HttpClient(二)-- 模拟浏览器抓取网页
一、设置请求头消息 User-Agent模拟浏览器
1.当使用第一节的代码 来 访问推酷的时候,会返回给我们如下信息:
网页内容:<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <p>系统检测亲不是真人行为,因系统资源限制,我们只能拒绝你的请求。如果你有疑问,可以通过微博 http://weibo.com/tuicool2012/ 联系我们。</p> </body> </html>
这是因为网站做了限制,限制别人爬。解决方式可以设置请求头消息 User-Agent模拟浏览器。代码如下:
/** * 抓取网页信息使用 get请求 * @param args * @throws IOException * @throws ClientProtocolException */ public static void main(String[] args) throws ClientProtocolException, IOException { // 创建httpClient实例 CloseableHttpClient httpClient = HttpClients.createDefault(); // 创建httpGet实例 HttpGet httpGet = new HttpGet("http://www.tuicool.com"); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0"); CloseableHttpResponse response = httpClient.execute(httpGet); if(response != null){ HttpEntity entity = response.getEntity(); // 获取网页内容 String result = EntityUtils.toString(entity, "UTF-8"); System.out.println("网页内容:" + result); } if(response != null){ response.close(); } if(httpClient != null){ httpClient.close(); } }
给HttpGet方法设置头消息,即可模拟浏览器访问。
二、获取响应内容Content-Type
使用 entity.getContentType().getValue() 来获取Content-Type,代码如下:
public static void main(String[] args) throws ClientProtocolException, IOException { // 创建httpClient实例 CloseableHttpClient httpClient = HttpClients.createDefault(); // 创建httpGet实例 HttpGet httpGet = new HttpGet("http://www.tuicool.com"); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0"); CloseableHttpResponse response = httpClient.execute(httpGet); if(response != null){ HttpEntity entity = response.getEntity(); // 获取网页内容 System.out.println("Content-Type:" + entity.getContentType().getValue()); // 获取Content-Type } if(response != null){ response.close(); } if(httpClient != null){ httpClient.close(); } }
三、获取响应状态
200 -- 正常
403 -- 拒绝
500 -- 服务器报错
400 -- 未找到页面
使用 response.getStatusLine().getStatusCode() 获取响应状态,代码如下:
public static void main(String[] args) throws ClientProtocolException, IOException { // 创建httpClient实例 CloseableHttpClient httpClient = HttpClients.createDefault(); // 创建httpGet实例 HttpGet httpGet = new HttpGet("http://www.tuicool.com"); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0"); CloseableHttpResponse response = httpClient.execute(httpGet); if(response != null){ int state = response.getStatusLine().getStatusCode(); System.out.println("响应状态:" + state); } if(response != null){ response.close(); } if(httpClient != null){ httpClient.close(); } }
四、HttpClient学习地址