模拟ajax实现网络爬虫——HtmlUnit
最近在用Jsoup抓取某网站数据,可有些页面是ajax请求动态生成的,去群里问了一下,大神说模拟ajax请求即可。去网上搜索了一下,发现了这篇文章,拿过来先用着试试。
转帖如下:
网上关于网络爬虫实现方式有很多种,但是很多都不支持Ajax,李兄说:模拟才是王道。确实,如果能够模拟一个没有界面的浏览器,还有什么不能做到的呢? 关于解析Ajax网站的框架也有不少,我选择了HtmlUnit,官方网站:http://htmlunit.sourceforge.net /,htmlunit可以说是一个Java版本的无界面浏览器,几乎无所不能,而且很多东西都封装得特别完美。这是这几天来积累下来的心血,记录一下。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | package com.lanyotech.www.wordbank; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import java.net.MalformedURLException; import java.util.List; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController; import com.gargoylesoftware.htmlunit.ScriptResult; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.HtmlOption; import com.gargoylesoftware.htmlunit.html.HtmlPage; import com.gargoylesoftware.htmlunit.html.HtmlSelect; public class WorldBankCrawl { private static String TARGET_URL = "http://databank.worldbank.org/ddp/home.do" ; public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { //模拟一个浏览器 WebClient webClient = new WebClient(); //设置webClient的相关参数 webClient.setJavaScriptEnabled( true ); webClient.setCssEnabled( false ); webClient.setAjaxController( new NicelyResynchronizingAjaxController()); webClient.setTimeout( 35000 ); webClient.setThrowExceptionOnScriptError( false ); //模拟浏览器打开一个目标网址 HtmlPage rootPage= webClient.getPage(TARGET_URL); //获取第一个数据库 HtmlSelect hs = (HtmlSelect) rootPage.getElementById( "lstCubes" ); //按要求选择第一个数据库 hs.getOption( 0 ).setSelected( true ); //模拟点击Next按钮,跳转到第二个页面 System.out.println( "正在跳转…" ); //执行按钮出发的js事件 ScriptResult sr = rootPage.executeJavaScript( "javascript:setCubeData(2,-1,4,'/ddp');" ); //跳转到第二个页面,选择国家 HtmlPage countrySelect = (HtmlPage) sr.getNewPage(); //获得包含全部国家信息的选择框页面 HtmlPage framePage=(HtmlPage)countrySelect.getFrameByName("frmTree1″).getEnclosedPage(); //获得selectAll按钮,触发js事件 framePage.executeJavaScript( "javascript:TransferListAll(‘countrylst','countrylstselected','no');SetSelectedCount(‘countrylstselected','tdcount');" ); //获取Next按钮,触发js事件 ScriptResult electricityScriptResult = framePage.executeJavaScript( "javascript:wrapperSetCube('/ddp')" ); System.out.println( "正在跳转…" ); //跳转到下一个页面electricitySelect HtmlPage electricitySelect = (HtmlPage) electricityScriptResult.getNewPage(); //获得electricity选择的iframe HtmlPage electricityFrame = (HtmlPage) electricitySelect.getFrameByName("frmTree1″).getEnclosedPage(); //获得选择框 HtmlSelect seriesSelect = (HtmlSelect) electricityFrame.getElementById( "countrylst" ); //获得所有的选择框内容 List optionList = seriesSelect.getOptions(); //将指定的选项选中 optionList.get( 1 ).setSelected( true ); //模拟点击select按钮 electricityFrame.executeJavaScript("javascript:TransferList('countrylst','countrylstselected','no');SetSelectedCount('countrylstselected','tdcount');"); //获取选中后,下面的选择框 HtmlSelect electricitySelected = (HtmlSelect) electricityFrame.getElementById( "countrylstselected" ); List list = electricitySelected.getOptions(); //模拟点击Next按钮,跳转到选择时间的页面 ScriptResult timeScriptResult = electricityFrame.executeJavaScript( "javascript:wrapperSetCube('/ddp')" ); System.out.println( "正在跳转…" ); HtmlPage timeSelectPage = (HtmlPage) timeScriptResult.getNewPage(); //获取选中时间的选择框 timeSelectPage = (HtmlPage) timeSelectPage.getFrameByName("frmTree1″).getEnclosedPage(); //选中所有的时间 timeSelectPage.executeJavaScript("javascript:TransferListAll('countrylst','countrylstselected','no');SetSelectedCount('countrylstselected','tdcount');"); //点击Next按钮 ScriptResult exportResult = timeSelectPage.executeJavaScript( "javascript:wrapperSetCube('/ddp')" ); System.out.println( "正在跳转…" ); //转到export页面 HtmlPage exportPage = (HtmlPage) exportResult.getNewPage(); //点击页面上的Export按钮,进入下载页面 ScriptResult downResult = exportPage.executeJavaScript( "javascript:exportData('/ddp' ,'EXT_BULK' ,'WDI_Time=51||WDI_Series=1||WDI_Ctry=244||' );" ); System.out.println( "正在跳转…" ); HtmlPage downLoadPage = (HtmlPage) downResult.getNewPage(); //点击Excel图标,开始下载 ScriptResult downLoadResult = downLoadPage.executeJavaScript( "javascript:exportData('/ddp','BULKEXCEL');" ); //下载Excel文件 InputStream is = downLoadResult.getNewPage().getWebResponse().getContentAsStream(); OutputStream fos = new FileOutputStream( "d://test.xls" ); byte [] buffer= new byte [ 1024 * 30 ]; int len=- 1 ; while ((len=is.read(buffer))> 0 ){ fos.write(buffer, 0 , len); } fos.close(); fos.close(); System.out.println( "Success!" ); } } |
注释:
/**HtmlUnit请求web页面*/ WebClient wc = new WebClient(); wc.getOptions().setJavaScriptEnabled(true); //启用JS解释器,默认为true wc.getOptions().setCssEnabled(false); //禁用css支持 wc.getOptions().setThrowExceptionOnScriptError(false); //js运行错误时,是否抛出异常 wc.getOptions().setTimeout(10000); //设置连接超时时间 ,这里是10S。如果为0,则无限期等待 HtmlPage page = wc.getPage("http://cq.qq.com/baoliao/detail.htm?294064"); String pageXml = page.asXml(); //以xml的形式获取响应文本
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步