Live2D

java Selenium爬取付费百度文库 JS 异步加载的内容

1. 分析

测试网址:百度文库

打开网页,F12打开开发者工具,搜索0.json,百度文库是js异步数据,所以使用Selenium直接爬取源代码 是爬取不到数据的,

image

我们打开一个请求,发现内容都在这个请求里

image

再找到url请求,可以获得内容的json

image

于是我们就从这个url入手,然后将取到的数据进行过滤就行了

2.代码实现

下面使用Koltin代码和java代码实现

Kotlin版本:

使用到的依赖如下,我使用的是gradle,如果使用maven,可以去maven仓库 https://mvnrepository.com/

搜索对应的坐标即可

implementation("org.jsoup:jsoup:1.14.3")
implementation("org.seleniumhq.selenium:selenium-java:4.0.0")

image

爬取主代码如下

import com.alibaba.fastjson.JSON
import com.alibaba.fastjson.JSONArray
import com.alibaba.fastjson.JSONObject
import org.openqa.selenium.firefox.FirefoxDriver
import org.openqa.selenium.firefox.FirefoxOptions
import org.openqa.selenium.firefox.FirefoxProfile

/**
 *@author 没有梦想的java菜鸟
 * @date 2022/03/07 10:12 上午
 */
class BaiduDocument {
    fun getInfo(url:String):String {
        System.setProperty("webdriver.gecko.driver", "/usr/local/bin/geckodriver")
        var options = FirefoxOptions()
        val profile = FirefoxProfile()
        //禁止GPU渲染
        options.addArguments("--disable-gpu")
        options.addArguments("--headless")
        //忽略错误
        options.addArguments("ignore-certificate-errors")
        //禁止浏览器被自动化的提示
        options.addArguments("--disable-infobars")
        //反爬关键:window.navigator.webdrive值=false*********************
        options.addPreference("dom.webdriver.enabled", false)
        //设置请求头
        profile.setPreference(
            "user-agent",
            "Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19"
        )
        val driver = FirefoxDriver(options)
        driver.get(url)
        // 过滤h5的元素 获取json
        val str= driver.pageSource.let {
            it.replace("<html><head><link rel=\"stylesheet\" href=\"resource://content-accessible/plaintext.css\"></head><body><pre>", "")
        }.let {
            it.replace(")</pre></body></html>","")
        }.let {
            it.replace(it.substring(0 until 8),"")
        }
        //解析json
        val jsonObject = JSON.parseObject(str)
        val array=jsonObject["body"] as JSONArray
        val sb=StringBuilder()
        for (i in 0 until  array.size){
            //存储的是空格 就换行 处理格式
            if (((array[i] as JSONObject)["c"] as String)==" "){
                sb.append("\r\n")
            }
          sb.append((array[i] as JSONObject)["c"])
        }
          driver.close()
        return sb.toString()
    }
}

测试类

class TestInfoa{

}

fun main() {
    val properties=Properties()
    properties.load(TestInfoa::class.java.classLoader.getResourceAsStream("config.properties"))
    val sb=StringBuilder()
    val document=BaiduDocument()
  // 从配置文件中取出来的是map,所以需要转换成List进行排序
    properties.toList().stream().sorted { o1, o2 ->  (o1.first as String).compareTo(o2.first as String) }.forEach {
        sb.append(document.getInfo(it.second as String))
    }
    println(sb.toString())
}

配置文件 config.properties(里面放的就是请求url)

注意:这个json请求是有过期时间的,出现403,就刷新网页,得到最新的

wk1=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=0-16174&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.zBNcrkhGO26YZsNYFW1IJJWSHx7UNjQVRJdakhBh3Fw%3D.1646637011
wk2=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=16175-31688&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.axsURrcQpJq0VV4chU0dH3GBxecoDG0M4%2FIr9MiV7tA%3D.1646637011
wk3=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=31689-49449&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.4gN8w%2Fflo8UlovBIbodkbgYCpM05hZ6wtmg36WEzSkU%3D.1646637011
wk4=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=49450-68119&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.PTyjuESx%2FCWe7oAxyy8tJeFy0ZisjCGBv9bFNYvkbV4%3D.1646637011
wk5=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=68120-86618&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.bVYGAXdfEI51GhCBR6smBSLyiYPXGKoJYydJr82lzp4%3D.1646637011
wk6=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A10%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A10%3A11Z%2F3600%2Fhost%2Fa44def08601226403491f5d303ff1c1cd35837bf3292c4c30ffd1ce5f7859c5d&x-bce-range=86619-104020&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzNzAxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.2Qhzql9I9mYXRMkVNq%2Bl1uOjNRWuUV9Qh1Fds9itao8%3D.1646637011
wk7=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A57%3A05%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A57%3A05Z%2F3600%2Fhost%2Fe5cd9579aadf9f4cd9aae8abcb452c8fd8481c2a9b9922b951ff89d655fe6f70&x-bce-range=104021-122632&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzOTgyNSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.3E3HLX%2BJaQnd2uBcrp9GTpD6GmR2Rmk8pxWrhnMf0oM%3D.1646639825
wk8=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A57%3A05%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A57%3A05Z%2F3600%2Fhost%2Fe5cd9579aadf9f4cd9aae8abcb452c8fd8481c2a9b9922b951ff89d655fe6f70&x-bce-range=122633-140352&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzOTgyNSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.KoqzmcPtdQFDJaNU0u3nMZd4GvNtAE1LTaZMmOD%2BgTQ%3D.1646639825
wk9=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2014%3A57%3A05%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T06%3A57%3A05Z%2F3600%2Fhost%2Fe5cd9579aadf9f4cd9aae8abcb452c8fd8481c2a9b9922b951ff89d655fe6f70&x-bce-range=140353-160852&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjYzOTgyNSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.0DCycDQu3KcfDzjHem2qG3v%2FvBS2vlYmSOjzlLEMhBE%3D.1646639825

java版本

maven依赖

<dependency>
   <groupId>org.jsoup</groupId>
   <artifactId>jsoup</artifactId>
   <version>1.14.3</version>
</dependency>
<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>4.0.0</version>
</dependency>

爬取代码

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxOptions;
import org.openqa.selenium.firefox.FirefoxProfile;

public class BaiduDocument {

    public static String getInfo(String url) {
        System.setProperty("webdriver.gecko.driver", "/usr/local/bin/geckodriver");
        FirefoxOptions options = new FirefoxOptions();
        FirefoxProfile profile = new FirefoxProfile();
        //禁止GPU渲染
        options.addArguments("--disable-gpu");
        options.addArguments("--headless");
        //忽略错误
        options.addArguments("ignore-certificate-errors");
        //禁止浏览器被自动化的提示
        options.addArguments("--disable-infobars");
        //反爬关键:window.navigator.webdrive值=false*********************
        options.addPreference("dom.webdriver.enabled", false);
        //设置请求头
        profile.setPreference(
                "user-agent",
                "Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19"
        );
        FirefoxDriver driver = new FirefoxDriver(options);
        driver.get(url);
        String newstr = driver.getPageSource().
                replace("<html><head><link rel=\"stylesheet\" href=\"resource://content-accessible/plaintext.css\"></head><body><pre>", "")
                .replace(")</pre></body></html>", "");
        String jsonStr = newstr.replace(newstr.substring(0, 8), "");
        //解析json
        JSONObject jsonObject = JSON.parseObject(jsonStr);
        JSONArray array = (JSONArray) (jsonObject.get("body"));
        StringBuilder sb = new StringBuilder();
        for (int i=0;i<array.size();i++){
            //存储的是空格 就换行 处理格式
            if ((" ").equals(((JSONObject)array.get(i)).get("c"))){
                sb.append("\r\n");
            }
            sb.append(((JSONObject)array.get(i)).get("c"));
        }
        return sb.toString();
    }

}

测试代码

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;

public class Test {
    public static void main(String[] args) throws IOException {
        Properties properties = new Properties();
        properties.load(Test.class.getClassLoader().getResourceAsStream("config.properties"));
        StringBuilder sb = new StringBuilder();
        BaiduDocument document = new BaiduDocument();
        Map<String, Object> map = new HashMap();
        properties.forEach(
                (k, v) -> {
                    map.put(k.toString(), v);
                }
        );
        map.entrySet().stream().sorted((o1,o2)->{
            return o1.getKey().compareTo(o2.getKey());
        }).forEach((str)->
           sb.append(document.getInfo((String)str.getValue())
        ));
        System.out.println(sb.toString());
    }
}

配置文件 config.properties (里面放的就是请求url)

注意:这个json请求是有过期时间的,出现403,就刷新网页,得到最新的

wk1=https://wkbjcloudbos.bdimg.com/v1/docconvert4844/wk/7210afedd7358a5bd48649cf6dad0de5/0.json?responseContentType=application%2Fjavascript&responseCacheControl=max-age%3D3888000&responseExpires=Thu%2C%2021%20Apr%202022%2016%3A55%3A11%20%2B0800&authorization=bce-auth-v1%2Ffa1126e91489401fa7cc85045ce7179e%2F2022-03-07T08%3A55%3A11Z%2F3600%2Fhost%2F64757b750d644baa47f34171b6cb9ea3d2528906584e210ba1034af283917d28&x-bce-range=68120-86618&token=eyJ0eXAiOiJKSVQiLCJ2ZXIiOiIxLjAiLCJhbGciOiJIUzI1NiIsImV4cCI6MTY0NjY0NjkxMSwidXJpIjp0cnVlLCJwYXJhbXMiOlsicmVzcG9uc2VDb250ZW50VHlwZSIsInJlc3BvbnNlQ2FjaGVDb250cm9sIiwicmVzcG9uc2VFeHBpcmVzIiwieC1iY2UtcmFuZ2UiXX0%3D.K0FaDDVBUkf15ipbr1tCL9nqo3A6n81ZuXWl2dF0qnk%3D.1646646911

3.总结

由于百度文库采用的是异步加载实现刷新数据,剩下的json请求,需要点击继续阅读,再滑动鼠标,发现json请求自动刷新

posted @ 2022-03-07 15:50  没有梦想的java菜鸟  阅读(430)  评论(0编辑  收藏  举报