【UiBot教程】【JS爬虫插件】基于浏览器RunJS,封装一个js数据抽取插件,简化抽取步骤

说明:

该插件是一个纯js脚本,通过WebBrowser.RunJS注入到浏览器页面。通过脚本创建爬虫对象,然后支持对象,事件,json,element,node,正则,字符串的链式抽取。该js可以直接在浏览器控制台运行。

 

2020.3.4更新:

1、增加ForEach方法。支持自定义方便多值结构抽取。

2、增加more方法。支持跨级同级多节点抽取。

 

接口说明:

/*
js爬虫封装方法,通过runjs注入脚本,执行获取结果
1、当前只做了chrome兼容
2、chrome如果开启了跨域,表达式中出现iframe则自动跨域
3、支持xpath,css-selector,jsonpath链式抽取
4、支持字符串结果链式处理

# 创建爬虫对象,可以传入抽取的element对象。默认document.body对象
# ele:对象数组
new crawler(ele);

# xpath抽取,只支持document下抽取。即时后置的链式抽取,也是从document下抽取的
# query:xpath表达式【必填】
$x(query)

# css-selector抽取。支持html的属性,方法,事件抽取,支持链式多次列表模板抽取。
# query:css-selector表达式【必填】。param:属性,方法,事件
$(query, param)

# json格式数据抽取。
# query:jsonPath表达式【必填】。param:PATH(json路径)、VALUE(取值)。默认提取值
$j(query, param)

# 字符串正则过滤。通过过滤条件排除或保留一些数据。
# query:正则表达式或字符串【必填】
filter(query)

# 字符串正则提取。
# query:正则表达式或字符串【必填】。index:数字或数字数组,抽取结果根据下标过滤
regex(query, index)

# 字符串替换。
# substr:被替换的字符串,正则表达式或字符串【必填】。replacement:替换后的字符串
replace(substr, replacement)

# 字符串拆分
# query:正则表达式或字符串【必填】
split(query)

# 链式字符串格式抽取,就是将链式字符串转为执行表达式
# expression:抽取表达式【必填】
mix(expression)

# 对外暴露抽取结果处理。参考Array.forEach,对应遍历Crawler的ele结果
# func:javascript方法
forEach(func)

# 用于同级多结果抽取,返回对象数组,或者二位数组
# exps:数组或json。value为Crawler的表达式
# types:字符串(mix|xpath|css|json|regex|replace|split|filter)
more(exps,types)

# 获取抽取结果。将ele对象转为可视化数据
get()

# 根据元素获取xpath或者css-selector就近唯一表达式
# elm:要定位的元素【必填】。xp:xpath(true),css-selector(false)
getSelector(elm, xp)

# 根据元素获取xpath或者css-selector完整链路表达式
# elm:要定位的元素【必填】。xp:xpath(true),css-selector(false)
getFullSelector(elm, xp)
*/
/*
js爬虫封装方法,通过runjs注入脚本,执行获取结果
1、当前只做了chrome兼容
2、chrome如果开启了跨域,表达式中出现iframe则自动跨域
3、支持xpath,css-selector,jsonpath链式抽取
4、支持字符串结果链式处理

# 创建爬虫对象,可以传入抽取的element对象。默认document.body对象
# ele:对象数组
new crawler(ele);

# xpath抽取,只支持document下抽取。即时后置的链式抽取,也是从document下抽取的
# query:xpath表达式【必填】
$x(query)

# css-selector抽取。支持html的属性,方法,事件抽取,支持链式多次列表模板抽取。
# query:css-selector表达式【必填】。param:属性,方法,事件
$(query, param)

# json格式数据抽取。
# query:jsonPath表达式【必填】。param:PATH(json路径)、VALUE(取值)。默认提取值
$j(query, param)

# 字符串正则过滤。通过过滤条件排除或保留一些数据。
# query:正则表达式或字符串【必填】
filter(query)

# 字符串正则提取。
# query:正则表达式或字符串【必填】。index:数字或数字数组,抽取结果根据下标过滤
regex(query, index)

# 字符串替换。
# substr:被替换的字符串,正则表达式或字符串【必填】。replacement:替换后的字符串
replace(substr, replacement)

# 字符串拆分
# query:正则表达式或字符串【必填】
split(query)

# 链式字符串格式抽取,就是将链式字符串转为执行表达式
# expression:抽取表达式【必填】
mix(expression)

# 对外暴露抽取结果处理。参考Array.forEach,对应遍历Crawler的ele结果
# func:javascript方法
forEach(func)

# 用于同级多结果抽取,返回对象数组,或者二位数组
# exps:数组或json。value为Crawler的表达式
# types:字符串(mix|xpath|css|json|regex|replace|split|filter)
more(exps,types)

# 获取抽取结果。将ele对象转为可视化数据
get()

# 根据元素获取xpath或者css-selector就近唯一表达式
# elm:要定位的元素【必填】。xp:xpath(true),css-selector(false)
getSelector(elm, xp)

# 根据元素获取xpath或者css-selector完整链路表达式
# elm:要定位的元素【必填】。xp:xpath(true),css-selector(false)
getFullSelector(elm, xp)
*/

 

插件使用:

需要依赖Chrome浏览器,将附件task导入工程,源码界面导入即可。

Import spider

 

Demo1:

Import spider
dim hWeb = ""
hWeb = WebBrowser.Create("chrome","https://forum.uibot.com.cn/",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""})
data = spider.xpath(hWeb,"//div[@class='media-body']//a/text()")
TracePrint(data)
data = spider.xpath(hWeb,"//div[@class='media-body']//a/@href")
TracePrint(data)
data = spider.css(hWeb,"div.media-body>div>a","innerText")
TracePrint(data)
data = spider.css(hWeb,"div.media-body>div>a","href")
TracePrint(data)
data = spider.mix(hWeb,'''$(".card-body li .media-body").more(["//a[1]/text()","//a[1]/@href","//div[1]/span[1]/text()","//div[1]/span[2]/text()"],"xpath")''')
TracePrint(data)
data = spider.mix(hWeb,'''$(".card-body li .media-body").more({"标题":"//a[1]/text()","地址":"//a[1]/@href","作者":"//div[1]/span[1]/text()","时间":"//div[1]/span[2]/text()"},"xpath")''')
TracePrint(data)

Demo2:

dim hWeb = ""
hWeb = WebBrowser.Create("chrome","http://9pk.5566rs.com/",3600000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""})
// hWeb = WebBrowser.BindBrowser("chrome",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200})

data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").more(["//td[1]/a/text()","//td[1]/a/@href","//td[3]/text()"],"xpath")''')

data = ""
For Each v In data1
data = data & Join(v,",") & "\n"
Next
dim dicts = {"":"a",\
"":"b",\
"":"c",\
"":"d",\
"":"e",\
"":"f",\
"":"g",\
"":"h",\
"":"i",\
"":"j",\
"":"k",\
"":"l",\
"":"m",\
"":"n",\
"":"o",\
"":"p",\
"":"q",\
"":"r",\
"":"s",\
"":"t",\
"":"u",\
"":"v",\
"":"w",\
"":"x",\
"":"y",\
"":"z",\
"":"A",\
"":"B",\
"":"C",\
"":"D",\
"":"E",\
"":"F",\
"":"G",\
"":"H",\
"":"I",\
"":"J",\
"":"K",\
"":"L",\
"":"M",\
"":"N",\
"":"O",\
"":"P",\
"":"Q",\
"":"R",\
"":"S",\
"":"T",\
"":"U",\
"":"V",\
"":"W",\
"":"X",\
"":"Y",\
"":"Z",\
"":"1",\
"":"2",\
"":"3",\
"":"4",\
"":"5",\
"":"6",\
"":"7",\
"":"8",\
"":"9",\
"":"0"}
For Each k,v In dicts
data = Replace(data,k,v)
Next
File.Write("d:\\aaa1.csv",data,"gbk")
 

Demo4:

Import spider

dim hWeb = ""
hWeb = WebBrowser.Create("chrome","http://9pk.5566rs.com/",3600000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""})
// hWeb = WebBrowser.BindBrowser("chrome",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200})

data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[1]/a/text()")''')
data2=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[1]/a/@href")''')
data3=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[3]/text()")''')

data = ""
For Each k,v In data1
res = []
push(res,v)
push(res,data2[k])
push(res,data3[k])
data = data & Join(res,",") & "\n"
Next
dim dicts = {"":"a",\
"":"b",\
"":"c",\
"":"d",\
"":"e",\
"":"f",\
"":"g",\
"":"h",\
"":"i",\
"":"j",\
"":"k",\
"":"l",\
"":"m",\
"":"n",\
"":"o",\
"":"p",\
"":"q",\
"":"r",\
"":"s",\
"":"t",\
"":"u",\
"":"v",\
"":"w",\
"":"x",\
"":"y",\
"":"z",\
"":"A",\
"":"B",\
"":"C",\
"":"D",\
"":"E",\
"":"F",\
"":"G",\
"":"H",\
"":"I",\
"":"J",\
"":"K",\
"":"L",\
"":"M",\
"":"N",\
"":"O",\
"":"P",\
"":"Q",\
"":"R",\
"":"S",\
"":"T",\
"":"U",\
"":"V",\
"":"W",\
"":"X",\
"":"Y",\
"":"Z",\
"":"1",\
"":"2",\
"":"3",\
"":"4",\
"":"5",\
"":"6",\
"":"7",\
"":"8",\
"":"9",\
"":"0"}
For Each k,v In dicts
data = Replace(data,k,v)
Next
File.Write("d:\\aaa.csv",data,"gbk")
 

更多:

1、链式抽取:

new crawler().$(".subject.break-all").$x("//a").filter(/857/ig).replace(/(t+)/,'hahaha$1').get()

 

2、事件触发:

new crawler().$(".subject.break-all","click()")

 

3、浏览器对象使用

new crawler(document.querySelectorAll(".subject.break-all")).$("a").get()

 

4、重复区块,子模板抽取

{"title":new crawler(document.querySelectorAll(".subject.break-all")).$("a").get(),

"content":new crawler(document.querySelectorAll(".subject.break-all")).$("a").get()

}

 

5、多节点抽取(高级功能)

new crawler().$("li a").filter("tabindex").$x("img").forEach(function(a,b,c){c[b]=a.src}).get()

 

6、快捷多节点抽取(高级功能)

data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").more(["//td[1]/a/text()","//td[1]/a/@href","//td[3]/text()"],"xpath")''')

 

更多的彩蛋大家自己发现。

PS:该JS为个人呕心之作,会用的人会发现超级实用。

 

下载插件请进原文地址下载:https://forum.uibot.com.cn/thread-869.htm

posted @ 2020-03-06 10:36  天降猛男  阅读(1964)  评论(0编辑  收藏  举报