使用SpringBoot + selenium-java 作爬虫
一、 Selenium 简介
Selenium 是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中,就像真正的用户在操做同样。支持的浏览器包括IE(7, 8, 9, 10, 11),Mozilla Firefox,Safari,Google Chrome,Opera等。这个工具的主要功能包括:测试与浏览器的兼容性——测试你的应用程序看是否可以很好得工做在不一样浏览器和操做系统之上。测试系统功能——建立回归测试检验软件功能和用户需求。支持自动录制动做和自动生成 .Net、Java、Perl等不一样语言的测试脚本。html
二、selenium-java
selenium-java 是 selenium的java 版,根据不一样driver,能够驱动不一样的浏览区,好比 selenium-chrome-driver、selenium-edge-driver、selenium-firefox-driver、selenium-ie-driver、selenium-opera-driver、phantomjsdriver等等,我用了其中的chromedriver 和 phantomjsdriver,这个能彻底模拟真实用户操做,不错的测试框架。java
三、 chromedriver 示例
3.一、 下载
如下是chromedriver对应的chrome版本:git
驱动 | 对应版本号 |
---|---|
2.37 | v64-66 |
2.36 | v63-65 |
2.35 | v62-64 |
2.34 | v61-63 |
2.33 | v60-62 |
2.32 | v59-61 |
2.31 | v58-60 |
2.30 | v58-60 |
2.29 | v56-58 |
驱动的下载地址以下:
http://chromedriver.storage.googleapis.com/index.html
注意:64位向下兼容,直接下载32位的就能够啦,亲测可用。web
3.二、 使用
ChromeOptions options = new ChromeOptions();
// 设置容许弹框
options.addArguments("disable-infobars","disable-web-security");
// 设置无gui 开发时仍是不要加,能够看到浏览器效果
options.addArguments("--headless");
String driverPath = "D:\\crawler-plugin\\chromedriver.exe";
System.setProperty("webdriver.chrome.driver", driverPath);
RemoteWebDriver driver= new ChromeDriver(options);
driver.get("http://www.baidu.com");
System.out.println(driver.findElement(By.tagName("body")).getText());
四、 phantomjsdriver示例
4.一、 下载
下载地址 http://phantomjs.org/download.htmlspring
4.二、 使用
String driverPath = "D:\\crawler-plugin\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe";
System.setProperty("phantomjs.binary.path", driverPath);//设置PhantomJs访问路径
DesiredCapabilities desiredCapabilities = DesiredCapabilities.phantomjs();
//设置参数
desiredCapabilities.setCapability("phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
desiredCapabilities.setCapability("phantomjs.page.customHeaders.User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
RemoteWebDriver driver = new PhantomJSDriver(desiredCapabilities);
driver.get("http://www.baidu.com");
System.out.println(driver.findElement(By.tagName("body")).getText());
五、 爬取页面常遇到的问题
5.一、 验证码
网站有时候须要登陆,登陆时候遇到验证码就很是棘手,tess4j能作简单的验证码识别,复杂的就别想了。。
maven 依赖sql
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.4.0</version>
<exclusions>
<exclusion>
<groupId>com.sun.jna</groupId>
<artifactId>jna</artifactId>
</exclusion>
</exclusions>
</dependency>
tess4j 配置项chrome
##tess4j config
tess4j.language=chi_sim
tess4j.language.path=D:\\crawler-plugin\\tessdata
tess4j.data.path=D:\\crawler-plugin\\
读取配置文件数据库
@Configuration
public class Tess4jConfig {
@Value("${tess4j.data.path}")
@Setter
@Getter
private String tess4jDataPath ;
@Value("${tess4j.language.path}")
@Setter
@Getter
private String tess4jLanguagePath ;
@Value("${tess4j.language}")
@Setter
@Getter
private String tess4jLanguage ;
}
工具类windows
import com.cdchen.crawler.config.Tess4jConfig;
import lombok.extern.slf4j.Slf4j;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.util.LoggHelper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.File;
@Slf4j
public class Tess4jUtil {
private static final Logger logger = LoggerFactory.getLogger(new LoggHelper().toString());
static final double MINIMUM_DESKEW_THRESHOLD = 0.05d;
private static ITesseract instance;
private static String datapath = "D:\\crawler-plugin\\";
private static String testResourcesLanguagePath = "D:\\crawler-plugin\\tessdata";
private static String language = "chi_sim";
private static ITesseract getInstance(){
Tess4jConfig config = SpringBeanUtil.getBean(Tess4jConfig.class);
if(config != null){
datapath = config.getTess4jDataPath();
language = config.getTess4jLanguage();
testResourcesLanguagePath = config.getTess4jLanguagePath();
}
if(datapath == null){
log.error("必须在properties配置tess4jdata.path,不然验证码没法识别");
return null;
}
if(testResourcesLanguagePath == null){
log.error("必须在properties配置tess4jlanguage.path,不然验证码没法识别");
return null;
}
if(language == null){
log.error("必须在properties配置tess4jlanguage,不然验证码没法识别");
return null;
}
if(instance == null){
instance = new Tesseract();
instance.setDatapath(new File(datapath).getPath());
//set language
instance.setDatapath(testResourcesLanguagePath);
instance.setLanguage(language);
}
return instance;
}
public static String doOcr(File file) throws Exception{
String result = getInstance().doOCR(file);
return result;
}
}
5.二、 翻页
翻页相对来就很简单了,有不少种解决方法,举例2种
1 找到翻页url规律,替换对应的页码
2 找到翻页按钮,模拟点击
我用的第二种,一下代码为爬取qiushi百科时候的翻页代码,供参考:
private void jumpPageNum(int pageNum){
if(WebElementUtils.doesWebElementExist(driver,By.className("pagination"))){
WebElement pagination = driver.findElement(By.className("pagination"));
String currentText = pagination.findElement(By.className("current")).getText();
int currentPageNum = Integer.parseInt(currentText);
while (currentPageNum != pageNum){
List<WebElement> pageNums = pagination.findElements(By.className("page-numbers"));
for (int i = 0; i < pageNums.size(); i++) {
String pageNumText = pageNums.get(i).getText();
if(pageNumText.equals(pageNum+"")){
pageNums.get(i).click();
scrollBar.toPageEnd();
break;
}else{
if(i == (pageNums.size()-1)){
pageNums.get(i).click();
}
}
}
pagination = driver.findElement(By.className("pagination"));
currentText = pagination.findElement(By.className("current")).getText();
currentPageNum = Integer.parseInt(currentText);
}
}
}
5.三、 滚动条
滚动条就比较麻烦了,由于Driver没有对应的api操做滚动条(或许我没有找到。。),我用了曲线救国的方法去实现,并且用一样的思路能够解决不少相似的问题。思路就是:使用JavaScript去操做滚动条,
实现步骤是:
- 向页面body内添加script tag,并把想执行的js function 插入进去
- 使用driver的js执行引擎去执行js实现效果
贴出来我写的工具类:
import com.cdchen.crawler.util.SleepUtil;
import lombok.Data;
import lombok.extern.slf4j.Slf4j;
import org.openqa.selenium.remote.RemoteWebDriver;
/**
*
* @description: tb
*
* @author: cdchen
*
* @create: 2019-04-30 17:08
**/
@Data
@Slf4j
public class ScrollBar {
RemoteWebDriver driver = null;
private static String getScrollTopJs = "function getScrollTop(){"
+ " var scrollTop = 0, bodyScrollTop = 0, documentScrollTop = 0;"
+ " if(document.body){"
+ " bodyScrollTop = document.body.scrollTop;"
+ " }"
+ " if(document.documentElement){"
+ " documentScrollTop = document.documentElement.scrollTop;"
+ " }"
+ " scrollTop = (bodyScrollTop - documentScrollTop > 0) ? bodyScrollTop : documentScrollTop;"
+ " return scrollTop;"
+ "};";
private static String getScrollHeightJs = "function getScrollHeight(){"
+ " var scrollHeight = 0, bodyScrollHeight = 0, documentScrollHeight = 0;"
+ " if(document.body){"
+ " bodyScrollHeight = document.body.scrollHeight;"
+ " }"
+ " if(document.documentElement){"
+ " documentScrollHeight = document.documentElement.scrollHeight;"
+ " }"
+ " scrollHeight = (bodyScrollHeight - documentScrollHeight > 0) ? bodyScrollHeight : documentScrollHeight;"
+ " return scrollHeight;"
+ "};";
private static String getWindowHeightJs = "function getWindowHeight(){"
+ " var windowHeight = 0;"
+ " if(document.compatMode == \"CSS1Compat\"){"
+ " windowHeight = document.documentElement.clientHeight;"
+ " }else{"
+ " windowHeight = document.body.clientHeight;"
+ " }"
+ " return windowHeight;"
+ "};";
private static String scroollIsOverJs = "function scroollIsOver(){"
+ " if(getScrollTop() + getWindowHeight() == getScrollHeight()){"
+ " return true;"
+ " }else{"
+ " return false;"
+ " }"
+ "};";
private static String insertScriptJs = "var body = document.getElementsByTagName('body')[0];"
+ "var newScript = document.createElement('script');"
+ "newScript.type = 'text/javascript';"
+ "newScript.innerHTML = '"+getScrollTopJs+getScrollHeightJs+getWindowHeightJs+scroollIsOverJs+"';"
+ "body.appendChild(newScript);";
public ScrollBar(RemoteWebDriver dr){
driver = dr;
}
public void toPageEnd() {
getDriver().executeScript(insertScriptJs);
int start = 0;
boolean scroollIsOver = false;
while (!scroollIsOver){
getDriver().executeScript("window.scrollTo("+start+","+(start+500)+")");
Boolean res = (Boolean)getDriver().executeScript("return scroollIsOver();");
if(res != null && res){
scroollIsOver = true;
}
start = start+500;
SleepUtil.sleep(1000);
}
}
}
5.四、 iframe 内元素没法获取
有时候页面内置了iframe,想要获取iframe的元素就获取不了,这是就须要把driver切换到iframe内,以下代码:
WebElement iframe = driver.findElement(By.tagName("iframe"));
driver.switchTo().frame(iframe);
// 而后再去获取元素或者其余操做、操做完需切换回来
driver.switchTo().parentFrame();
5.五、 标签页切换
有时候用chromedriver 时候须要开启多个标签页如何操做?
List<String> tabs = new ArrayList<String>(driver.getWindowHandles());
// 切换到第一个标签
driver.switchTo().window(tabs.get(0));
// 切换到第 **n**个标签
driver.switchTo().window(tabs.get(n));
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具