htmlunit 基础01

说明

Jvm系后端访问Web包。
API：https://htmlunit.sourceforge.io/apidocs/index.html

gradle引用

 // 抓取网页
// https://mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit
compile group: 'net.sourceforge.htmlunit', name: 'htmlunit', version: '2.44.0'
// 解析网页
// https://mvnrepository.com/artifact/org.jsoup/jsoup
compile group: 'org.jsoup', name: 'jsoup', version: '1.13.1'

案例

 import com.gargoylesoftware.htmlunit.BrowserVersion
import com.gargoylesoftware.htmlunit.WebClient
import com.gargoylesoftware.htmlunit.html.DomElement
import com.gargoylesoftware.htmlunit.html.HtmlPage
 
 
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_68)
// 启用JS解释器，默认为true
webClient.getOptions().setJavaScriptEnabled(true)
// 禁用css支持
webClient.getOptions().setCssEnabled(false)
// js运行错误时，是否抛出异常
webClient.getOptions().setThrowExceptionOnScriptError(false)
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false)
// 设置连接超时时间
webClient.getOptions().setTimeout(10 * 1000)
 
HtmlPage htmlPage = webClient.getPage("https://news.sina.com.cn/roll/")
// 等待JS执行，再访问的时候需要给JS一些执行时间。
webClient.waitForBackgroundJavaScript(10 * 1000)
// 返回所有的文本
String text = htmlPage.asText()
// 返回html代码
String html = htmlPage.asXml()
// 获取指定 Dom 元素
DomElement spanDom = htmlPage.getElementByName("span")
// 获取内容
spanDom.getTextContent()
// 点击操作
spanDom.click()

posted @ 2020-10-28 15:06 duchaoqun 阅读(178) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· TypeScript + Deepseek 打造卜卦网站：技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 三行代码完成国际化适配，妙~啊~
· .NET Core 中如何实现缓存的预热？

公告

昵称： duchaoqun
园龄： 6年7个月
粉丝： 2
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

随笔分类 (144)

随笔档案 (161)

阅读排行榜

评论排行榜

1. sbt - sbt 2 wrong checksum(2)

平凡之路

关注业界, 关注互联网...

htmlunit 基础01

说明

gradle引用

案例

公告

我的标签

积分与排名

随笔分类 (144)

随笔档案 (161)

阅读排行榜

评论排行榜

最新评论

	// 抓取网页
	// https://mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit
	compile group: 'net.sourceforge.htmlunit', name: 'htmlunit', version: '2.44.0'
	// 解析网页
	// https://mvnrepository.com/artifact/org.jsoup/jsoup
	compile group: 'org.jsoup', name: 'jsoup', version: '1.13.1'

	import com.gargoylesoftware.htmlunit.BrowserVersion
	import com.gargoylesoftware.htmlunit.WebClient
	import com.gargoylesoftware.htmlunit.html.DomElement
	import com.gargoylesoftware.htmlunit.html.HtmlPage


	WebClient webClient = new WebClient(BrowserVersion.FIREFOX_68)
	// 启用JS解释器，默认为true
	webClient.getOptions().setJavaScriptEnabled(true)
	// 禁用css支持
	webClient.getOptions().setCssEnabled(false)
	// js运行错误时，是否抛出异常
	webClient.getOptions().setThrowExceptionOnScriptError(false)
	webClient.getOptions().setThrowExceptionOnFailingStatusCode(false)
	// 设置连接超时时间
	webClient.getOptions().setTimeout(10 * 1000)

	HtmlPage htmlPage = webClient.getPage("https://news.sina.com.cn/roll/")
	// 等待JS执行，再访问的时候需要给JS一些执行时间。
	webClient.waitForBackgroundJavaScript(10 * 1000)
	// 返回所有的文本
	String text = htmlPage.asText()
	// 返回html代码
	String html = htmlPage.asXml()
	// 获取指定 Dom 元素
	DomElement spanDom = htmlPage.getElementByName("span")
	// 获取内容
	spanDom.getTextContent()
	// 点击操作
	spanDom.click()