Java学习之爬虫篇
Java学习之爬虫篇
0x00 前言
总结完基础阶段,来写个爬虫练练手,从中能学到不少。
0x01 爬虫结构与概念
爬虫更官方点的名字叫数据采集,英文一般称作spider,就是通过编程来全自动的从互联网上采集数据。
爬虫需要做的就是模拟正常的网络请求,比如你在网站上点击一个网址,就是一次网络请求。
这里可以再来说说爬虫在渗透中的作用,例如我们需要批量去爬取该网站上面的外链或者是论坛的发帖人用户名,手机号这些。如果说我们手工去进行收集的话,大大影响效率。
爬虫的流程总体来说其实就是请求,过滤也就是数据提取,然后就是对提取的内容存储。
0x02 爬虫的请求
maven:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.12</version>
</dependency>
这里那先知论坛来做一个演示,
get请求
package is.text;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class http1get {
public static void main(String[] args) {
CloseableHttpClient client = HttpClients.createDefault(); //创建httpclient 对象。
HttpGet httpGet = new HttpGet("https://xz.aliyun.com/?page=1"); //创建get请求对象。
CloseableHttpResponse response = null;
try {
response = client.execute(httpGet); //发送get请求
if (response.getStatusLine().getStatusCode()==200){
String s = EntityUtils.toString(response.getEntity(),"utf-8");
System.out.println(s);
System.out.println(httpGet);
}
} catch (IOException e) {
e.printStackTrace();
}finally {
try {
response.close();
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
方法解析:
createDefault
公共静态CloseableHttpClient createDefault()
CloseableHttpClient使用默认配置创建实例。
get携带参数请求:
package is.text;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.net.URISyntaxException;
public class http1get {
public static void main(String[] args) throws URISyntaxException {
CloseableHttpClient client = HttpClients.createDefault(); //创建httpclient 对象。
URIBuilder uriBuilder = new URIBuilder("https://xz.aliyun.com/"); //使用URIBuilder设置地址
uriBuilder.setParameter("page","2"); //设置传入参数
HttpGet httpGet = new HttpGet(uriBuilder.build()); //创建get请求对象。
// https://xz.aliyun.com/?page=1
CloseableHttpResponse response = null;
try {
response = client.execute(httpGet); //发送get请求
if (response.getStatusLine().getStatusCode()==200){
String s = EntityUtils.toString(response.getEntity(),"utf-8");
System.out.println(s);
System.out.println(httpGet);
}
} catch (IOException e) {
e.printStackTrace();
}finally {
try {
response.close();
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
post请求
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class httppost {
public static void main(String[] args) {
CloseableHttpClient client = HttpClients.createDefault();
HttpPost httpPost = new HttpPost("https://xz.aliyun.com/");
CloseableHttpResponse response = null;
try {
response = client.execute(httpPost);
String s = EntityUtils.toString(response.getEntity());
System.out.println(s);
System.out.println(httpPost);
} catch (IOException e) {
e.printStackTrace();
}
}
}
在get和post的请求不携带参数请求当中,get的请求方式和post的请求方式基本类似。但是创建请求对象时,get请求用的是HttpGet
来生成对象,而Post则是HttpPost
来生成对象。
post携带参数请求
package is.text;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class httpparams {
public static void main(String[] args) throws IOException {
CloseableHttpClient client = HttpClients.createDefault();//创建httpClients对象
HttpPost httpPost = new HttpPost("https://xz.aliyun.com/"); //设置请求对象
List<NameValuePair> params = new ArrayList<NameValuePair>(); //声明list集合,存储传入参数
params.add(new BasicNameValuePair("page","3"));
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf-8"); //创建表单的Entity对象,传入params参数
httpPost.setEntity(formEntity); //设置表单内容到post包中
CloseableHttpResponse response = client.execute(httpPost);
String s = EntityUtils.toString(response.getEntity());
System.out.println(s);
System.out.println(s.length());
System.out.println(httpPost);
}
}
连接池
如果每次请求都要创建HttpClient,会有频繁创建和销毁的问题,可以使用连接池来解决这个问题。
创建一个连接池对象:
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
常用方法:
PoolingHttpClientConnectionManager 类
public void setMaxTotal(int max)
设置最大连接数
public void setDefaultMaxPerRoute(int max)
设置每个主机的并发数
HttpClients类
常用方法:
createDefault()
CloseableHttpClient使用默认配置 创建实例。
custom()
创建用于构建定制CloseableHttpClient实例的构建器对象 。
创建连接池代码
package is.text;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class PoolHttpGet {
public static void main(String[] args) {
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
cm.setMaxTotal(100); //设置最大连接数
cm.setDefaultMaxPerRoute(100); //设置每个主机的并发数
doGet(cm);
doGet(cm);
}
private static void doGet(PoolingHttpClientConnectionManager cm) {
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
HttpGet httpGet = new HttpGet("www.baidu.com");
try {
CloseableHttpResponse response = httpClient.execute(httpGet);
String s = EntityUtils.toString(response.getEntity(),"utf-8");
} catch (IOException e) {
e.printStackTrace();
}
}
}
HttpClient 请求配置
package is.text;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class gethttp1params {
public static void main(String[] args) throws IOException {
CloseableHttpClient client = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://www.baidu.com");
RequestConfig config = RequestConfig.custom().setConnectTimeout(1000) // 设置创建连接的最长时间
.setConnectionRequestTimeout(500) //设置获取连接最长时间
.setSocketTimeout(500).build(); //设置数据传输最长时间
httpGet.setConfig(config);
CloseableHttpResponse response = client.execute(httpGet);
String s = EntityUtils.toString(response.getEntity());
System.out.println(s);
}
}
0x03 爬虫的数据提取
jsoup
jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
jsoup的主要功能如下:
- 从一个URL,文件或字符串中解析HTML;
- 使用DOM或CSS选择器来查找、取出数据;
- 可操作HTML元素、属性、文本;
来写一段爬取论坛title的代码:
package Jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.junit.Test;
import java.net.URL;
public class JsoupTest1 {
@Test
public void testUrl() throws Exception {
Document doc = Jsoup.parse(new URL("https://xz.aliyun.com/"),10000);//设置请求url与超时时间
String title = doc.getElementsByTag("title").first().text();// //获取title的内容
System.out.println(title);
}
}
这里的 first()代表获取第一个元素,text()表示获取标签内容
dom遍历元素
getElementById 根据id查询元素
getElementsByTag
根据标签获取元素
getElementsByClass 根据class获取元素
getElementsByAttribute 根据属性获取元素
爬取先知论坛文章
package Jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.junit.Test;
import java.io.File;
import java.io.IOException;
import java.net.URL;
public class HttpDomTest {
@Test
public void TestDom() throws IOException {
Document doc = Jsoup.parse(new URL("https://xz.aliyun.com/t/8091"), 10000);
String topic_content = doc.getElementById("topic_content").text();
String titile = doc.getElementsByClass("content-title").first().text();
System.out.println("title"+titile);
System.out.println("topic_content"+topic_content);
}
}
爬取10页全部文章
元素中获取数据:
1. 从元素中获取id
2. 从元素中获取className
3. 从元素中获取属性的值attr
4. 从元素中获取所有属性attributes
5. 从元素中获取文本内容text
package Jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import java.io.File;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.List;
public class HttpDomTest10 {
@Test
public void xianzhi10test() throws Exception {
String url = "https://xz.aliyun.com";
Document doc = Jsoup.parse(new URL(url), 10000);
Elements element = doc.getElementsByClass("topic-title");
List<String> href = element.eachAttr("href");
for (String s : href) {
try{
Document requests = Jsoup.parse(new URL(url+s), 100000);
String topic_content = requests.getElementById("topic_content").text();
String titile = requests.getElementsByClass("content-title").first().text();
System.out.println(titile);
System.out.println(topic_content);
}catch (Exception e){
System.out.println("爬取"+url+s+"报错"+"报错信息"+e);
}
}
/*
String topic_content = doc.getElementById("topic_content").text();
String titile = doc.getElementsByClass("content-title").first().text();
System.out.println("title"+titile);
System.out.println("topic_content"+topic_content);
*/
}
}
成功爬取到一页的内容。
既然能爬取一页内容,那么我们可以直接定义一个for循环遍历10次,然后进行请求。
爬取10页的内容就这么完成了。
package Jsoup;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import java.io.File;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.List;
public class HttpdomTEST2 {
@Test
public void xianzhi10test() throws Exception {
String url = "https://xz.aliyun.com/";
for (int i = 1; i < 10; i++) {
String requesturl = "https://xz.aliyun.com/?page="+i;
Document doc = Jsoup.parse(new URL(requesturl), 10000);
Elements element = doc.getElementsByClass("topic-title");
List<String> href = element.eachAttr("href");
for (String s : href) {
try{
Document requests = Jsoup.parse(new URL(url+s), 100000);
String topic_content = requests.getElementById("topic_content").text();
String titile = requests.getElementsByClass("content-title").first().text();
System.out.println(titile);
System.out.println(topic_content);
}catch (Exception e){
System.out.println("爬取"+url+s+"报错"+"报错信息"+e);
}
}
}
/*
String topic_content = doc.getElementById("topic_content").text();
String titile = doc.getElementsByClass("content-title").first().text();
System.out.println("title"+titile);
System.out.println("topic_content"+topic_content);
*/
}
}
爬虫爬取一页的内容的连接再去请求,如果一页里面有十几篇文章,爬取十页的话,那么这下请求肯定就多了,单线程是远远不够的。需要多线程来进行爬取数据。
多线程爬取文章自定义线程与页面
实现类:
import java.util.concurrent.locks.ReentrantLock;
public class Climbimpl implements Runnable {
private String url ;
private int pages;
Lock lock = new ReentrantLock();
public Climbimpl(String url, int pages) {
this.url = url;
this.pages = pages;
}
public void run() {
lock.lock();
// String url = "https://xz.aliyun.com/";
System.out.println(this.pages);
for (int i = 1; i < this.pages; i++) {
try {
String requesturl = this.url+"?page="+i;
Document doc = null;
doc = Jsoup.parse(new URL(requesturl), 10000);
Elements element = doc.getElementsByClass("topic-title");
List<String> href = element.eachAttr("href");
for (String s : href) {
try{
Document requests = Jsoup.parse(new URL(this.url+s), 100000);
String topic_content = requests.getElementById("topic_content").text();
String titile = requests.getElementsByClass("content-title").first().text();
System.out.println(titile);
System.out.println(topic_content);
}catch (Exception e){
System.out.println("爬取"+this.url+s+"报错"+"报错信息"+e);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
lock.unlock();
}
}
main:
package Jsoup;
public class TestClimb {
public static void main(String[] args) {
int Threadlist_num = 50; //线程数
String url = "https://xz.aliyun.com/"; //url
int pages = 10; //读取页数
Climbimpl climbimpl = new Climbimpl(url,pages);
for (int i = 0; i < Threadlist_num; i++) {
new Thread(climbimpl).start();
}
}
}
Select选择器
tagname: 通过标签查找元素,比如:span
#id: 通过ID查找元素,比如:# city_bj
.class: 通过class名称查找元素,比如:.class_a
[attribute]: 利用属性查找元素,比如:[abc]
[attr=value]: 利用属性值来查找元素,比如:[class=s_name]
代码案例:
通过标签查找元素:
Elements span = document.select("span");
通过id查找元素:
String str = document.select("#city_bj").text();
通过类名查找元素:
str = document.select(".class_a").text();
通过属性查找元素
str = document.select("[abc]").text();
属性值来查找元素:
str = document.select("[class=s_name]").text();
标签+元素组合:
str = document.select("span[abc]").text();
任意组合:
str = document.select("span[abc].s_name").text();
查找某个父元素下的直接子元素:
str = document.select(".city_con > ul > li").text();
查找某个父元素下所有直接子元素:
str =
document.select(".city_con > *").text();
0x04 结尾
java的爬虫依赖于jsoup,jsoup基本集成了爬虫所有需要的功能。