java Selenium爬取付费百度文库 JS 异步加载的内容
1. 分析
测试网址:百度文库
打开网页,F12打开开发者工具,搜索0.json
,百度文库是js异步数据,所以使用Selenium直接爬取源代码 是爬取不到数据的,
我们打开一个请求,发现内容都在这个请求里
再找到url请求,可以获得内容的json
于是我们就从这个url入手,然后将取到的数据进行过滤就行了
2.代码实现
下面使用Koltin代码和java代码实现
Kotlin版本:
使用到的依赖如下,我使用的是gradle,如果使用maven,可以去maven仓库 https://mvnrepository.com/
搜索对应的坐标即可
implementation("org.jsoup:jsoup:1.14.3")
implementation("org.seleniumhq.selenium:selenium-java:4.0.0")
爬取主代码如下
import com.alibaba.fastjson.JSON
import com.alibaba.fastjson.JSONArray
import com.alibaba.fastjson.JSONObject
import org.openqa.selenium.firefox.FirefoxDriver
import org.openqa.selenium.firefox.FirefoxOptions
import org.openqa.selenium.firefox.FirefoxProfile
/**
*@author 没有梦想的java菜鸟
* @date 2022/03/07 10:12 上午
*/
class BaiduDocument {
fun getInfo(url:String):String {
System.setProperty("webdriver.gecko.driver", "/usr/local/bin/geckodriver")
var options = FirefoxOptions()
val profile = FirefoxProfile()
//禁止GPU渲染
options.addArguments("--disable-gpu")
options.addArguments("--headless")
//忽略错误
options.addArguments("ignore-certificate-errors")
//禁止浏览器被自动化的提示
options.addArguments("--disable-infobars")
//反爬关键:window.navigator.webdrive值=false*********************
options.addPreference("dom.webdriver.enabled", false)
//设置请求头
profile.setPreference(
"user-agent",
"Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19"
)
val driver = FirefoxDriver(options)
driver.get(url)
// 过滤h5的元素 获取json
val str= driver.pageSource.let {
it.replace("<html><head><link rel=\"stylesheet\" href=\"resource://content-accessible/plaintext.css\"></head><body><pre>", "")
}.let {
it.replace(")</pre></body></html>","")
}.let {
it.replace(it.substring(0 until 8),"")
}
//解析json
val jsonObject = JSON.parseObject(str)
val array=jsonObject["body"] as JSONArray
val sb=StringBuilder()
for (i in 0 until array.size){
//存储的是空格 就换行 处理格式
if (((array[i] as JSONObject)["c"] as String)==" "){
sb.append("\r\n")
}
sb.append((array[i] as JSONObject)["c"])
}
driver.close()
return sb.toString()
}
}
测试类
class TestInfoa{
}
fun main() {
val properties=Properties()
properties.load(TestInfoa::class.java.classLoader.getResourceAsStream("config.properties"))
val sb=StringBuilder()
val document=BaiduDocument()
// 从配置文件中取出来的是map,所以需要转换成List进行排序
properties.toList().stream().sorted { o1, o2 -> (o1.first as String).compareTo(o2.first as String) }.forEach {
sb.append(document.getInfo(it.second as String))
}
println(sb.toString())
}
配置文件 config.properties
(里面放的就是请求url)
注意:这个json请求是有过期时间的,出现403,就刷新网页,得到最新的
wk1=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=0-16174&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.zBNcrkhGO26YZsNYFW1IJJWSHx7UNjQVRJdakhBh3Fw%3D.1646637011
wk2=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=16175-31688&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.axsURrcQpJq0VV4chU0dH3GBxecoDG0M4%2FIr9MiV7tA%3D.1646637011
wk3=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=31689-49449&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.4gN8w%2Fflo8UlovBIbodkbgYCpM05hZ6wtmg36WEzSkU%3D.1646637011
wk4=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=49450-68119&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.PTyjuESx%2FCWe7oAxyy8tJeFy0ZisjCGBv9bFNYvkbV4%3D.1646637011
wk5=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=68120-86618&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.bVYGAXdfEI51GhCBR6smBSLyiYPXGKoJYydJr82lzp4%3D.1646637011
wk6=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=86619-104020&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.2Qhzql9I9mYXRMkVNq%2Bl1uOjNRWuUV9Qh1Fds9itao8%3D.1646637011
wk7=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A57%3A05%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A57%3A05Z%2F3600%2Fhost%2Fe5cd9579aadf9f4cd9aae8abcb452c8fd8481c2a9b9922b951ff89d655fe6f70&x-bce-range=104021-122632&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzOTgyNSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.3E3HLX%2BJaQnd2uBcrp9GTpD6GmR2Rmk8pxWrhnMf0oM%3D.1646639825
wk8=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A57%3A05%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A57%3A05Z%2F3600%2Fhost%2Fe5cd9579aadf9f4cd9aae8abcb452c8fd8481c2a9b9922b951ff89d655fe6f70&x-bce-range=122633-140352&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzOTgyNSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.KoqzmcPtdQFDJaNU0u3nMZd4GvNtAE1LTaZMmOD%2BgTQ%3D.1646639825
wk9=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A57%3A05%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A57%3A05Z%2F3600%2Fhost%2Fe5cd9579aadf9f4cd9aae8abcb452c8fd8481c2a9b9922b951ff89d655fe6f70&x-bce-range=140353-160852&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzOTgyNSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.0DCycDQu3KcfDzjHem2qG3v%2FvBS2vlYmSOjzlLEMhBE%3D.1646639825
java版本
maven依赖
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.0.0</version>
</dependency>
爬取代码
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxOptions;
import org.openqa.selenium.firefox.FirefoxProfile;
public class BaiduDocument {
public static String getInfo(String url) {
System.setProperty("webdriver.gecko.driver", "/usr/local/bin/geckodriver");
FirefoxOptions options = new FirefoxOptions();
FirefoxProfile profile = new FirefoxProfile();
//禁止GPU渲染
options.addArguments("--disable-gpu");
options.addArguments("--headless");
//忽略错误
options.addArguments("ignore-certificate-errors");
//禁止浏览器被自动化的提示
options.addArguments("--disable-infobars");
//反爬关键:window.navigator.webdrive值=false*********************
options.addPreference("dom.webdriver.enabled", false);
//设置请求头
profile.setPreference(
"user-agent",
"Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19"
);
FirefoxDriver driver = new FirefoxDriver(options);
driver.get(url);
String newstr = driver.getPageSource().
replace("<html><head><link rel=\"stylesheet\" href=\"resource://content-accessible/plaintext.css\"></head><body><pre>", "")
.replace(")</pre></body></html>", "");
String jsonStr = newstr.replace(newstr.substring(0, 8), "");
//解析json
JSONObject jsonObject = JSON.parseObject(jsonStr);
JSONArray array = (JSONArray) (jsonObject.get("body"));
StringBuilder sb = new StringBuilder();
for (int i=0;i<array.size();i++){
//存储的是空格 就换行 处理格式
if ((" ").equals(((JSONObject)array.get(i)).get("c"))){
sb.append("\r\n");
}
sb.append(((JSONObject)array.get(i)).get("c"));
}
return sb.toString();
}
}
测试代码
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
public class Test {
public static void main(String[] args) throws IOException {
Properties properties = new Properties();
properties.load(Test.class.getClassLoader().getResourceAsStream("config.properties"));
StringBuilder sb = new StringBuilder();
BaiduDocument document = new BaiduDocument();
Map<String, Object> map = new HashMap();
properties.forEach(
(k, v) -> {
map.put(k.toString(), v);
}
);
map.entrySet().stream().sorted((o1,o2)->{
return o1.getKey().compareTo(o2.getKey());
}).forEach((str)->
sb.append(document.getInfo((String)str.getValue())
));
System.out.println(sb.toString());
}
}
配置文件 config.properties
(里面放的就是请求url)
注意:这个json请求是有过期时间的,出现403,就刷新网页,得到最新的
wk1=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2016%3A55%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T08%3A55%3A11Z%2F3600%2Fhost%2F64757b750d644baa47f34171b6cb9ea3d2528906584e210ba1034af283917d28&x-bce-range=68120-86618&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjY0NjkxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.K0FaDDVBUkf15ipbr1tCL9nqo3A6n81ZuXWl2dF0qnk%3D.1646646911
3.总结
由于百度文库采用的是异步加载实现刷新数据,剩下的json请求,需要点击继续阅读,再滑动鼠标,发现json请求自动刷新
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)