WebMagic简单使用,爬取体彩开奖结果
Java爬虫有很多,WebMagic是其中一个,文档齐全,入门简单,个人用来爬取一些小数据很不错,以下以爬取彩票开奖结果为例,介绍一下基本用法。
WebMagic官网文档http://webmagic.io/docs/zh/,文档很细致,通过实例介绍了一个完整的爬取过程,并持久化爬取结果。
WebMagic封装的很好,一般来说我们只用定义自己的PageProcessor(用于提取数据),Pipeline(用于处理提取的数据,如持久化)
下面依葫芦画瓢,我们来爬取彩票的开奖结果,以下内容仅限个人学习使用
需求:爬取彩票的开奖结果,并写入数据库
我们基于springboot框架开始,springboot可以方便的执行定时爬取,结合mybatis把数据写入数据库
----------------------------我是分割线---------------------------
爬取的源:体彩官网(https://www.lottery.gov.cn/),500彩票(https://kaijiang.500.com/),新浪爱彩(https://kaijiang.aicai.com/)
体彩的玩法有:大乐透,7星彩,排列3,排列5
打开体彩官网,首页可以看到最近一期各种玩法的开奖结果
我们点击前面的各个玩法,可以进去看详情,至于为何要点击进去,作为初学者,单个玩法单独处理可能会简单明了
打开大乐透详情页面(https://www.lottery.gov.cn/dlt/index.html),chrome浏览器按F12打开开发工具,刷新页面,看看请求的过程
逐一观察请求,发现这个请求
这个请求返回的是JSON,数据完全符合我们的需求,直接利用就好了
这时你可能会质疑为啥你要先看请求的过程,而不是分析页面的内容,其实在找到这个请求之前,我也分析过页面,页面的源码中并没有开奖的数据,所有我断定数据是通过后加载的方式填入页面的,想到这里当然要看请求咯
JSON数据最好不过了,反解析后直接使用,少了在HTML中提取数据的过程,核心代码如下:
定义TcOrgProcessor类,写如何提取我们需要的数据
public class TcOrgProcessor implements PageProcessor { private final Logger logger = LoggerFactory.getLogger(TcOrgProcessor.class); private static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"); public static final String DLT_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=85,0&isVerify=1"; public static final String QXC_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=04,0&isVerify=1"; public static final String PL5_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=35,0;350133,0&isVerify=1"; private final Site site = Site.me(); @Override public void process(Page page) { String url = page.getUrl().toString(); //预处理返回结果 String text = page.getRawText(); JSONObject rootInfo = JSON.parseObject(text); if (rootInfo.getIntValue("errorCode") != 0) { page.setSkip(true); logger.error("请求结果错误,URL=>{},内容=>{}", url, text); return; } //读取value字段 JSONObject valueObject = rootInfo.getJSONObject("value"); //大乐透,七星彩,排列的开奖结果模型相同,只是字段不同 try { String[] keys = new String[] { "dlt", "qxc", "plw", "pls" }; List<DrawInfo> drawInfos = new ArrayList<>(); for (String key : keys) { JSONObject drawInfoObject = valueObject.getJSONObject(key); if (drawInfoObject == null || drawInfoObject.isEmpty()) continue; //处理结果 String gameName = drawInfoObject.getString("lotteryGameName"); String drawNum = drawInfoObject.getString("lotteryDrawNum"); String strDrawResult = drawInfoObject.getString("lotteryDrawResult").replaceAll(" ", ","); LocalDate drawDate = LocalDate.parse(drawInfoObject.getString("lotteryDrawTime"), DATE_FORMAT); String poolBalance = drawInfoObject.getString("poolBalanceAfterdraw").replaceAll(",", ""); //构造开奖信息模型 List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList()); int poolIntValue = new BigDecimal(poolBalance).intValue(); DrawInfo drawInfo = new DrawInfo(gameName, drawNum, drawDate, drawResult, poolIntValue, Source.TC_ORG); drawInfos.add(drawInfo); } //存入结果集 page.putField("results", drawInfos); } catch (Exception e) { logger.error("解析异常:{}", e.getMessage()); page.setSkip(true); } } @Override public Site getSite() { return site; } public static void main(String[] args) { Spider.create(new TcOrgProcessor()).addUrl(DLT_URL).addUrl(QXC_URL).addUrl(PL5_URL).addPipeline(new ConsolePipeline()).run(); }
DrawInfo类是开奖信息模型,我们将提取的数据,标准化成这个模型,方便在后续的Pipeline中使用
/** * 开奖信息模型 */ public class DrawInfo { //游戏名称[大乐透,7星彩,排列5] private String game; //期号[21035] private String expect; //开奖日期[2021-03-15] private LocalDate drawDate; //开奖结果[1,2,3,4,5] private List<Integer> drawResult; //奖池[188827520] private int poolBalance; //采集来源 private Source source; public String getGame() { return game; } public void setGame(String game) { this.game = game; } public String getExpect() { return expect; } public void setExpect(String expect) { this.expect = expect; } public LocalDate getDrawDate() { return drawDate; } public void setDrawDate(LocalDate drawDate) { this.drawDate = drawDate; } public List<Integer> getDrawResult() { return drawResult; } public void setDrawResult(List<Integer> drawResult) { this.drawResult = drawResult; } public int getPoolBalance() { return poolBalance; } public void setPoolBalance(int poolBalance) { this.poolBalance = poolBalance; } public Source getSource() { return source; } public void setSource(Source source) { this.source = source; } public DrawInfo(String game, String expect, LocalDate drawDate, List<Integer> drawResult, int poolBalance, Source source) { this.game = game; this.expect = expect; this.drawDate = drawDate; this.drawResult = drawResult; this.poolBalance = poolBalance; this.source = source; } @Override public String toString() { return "DrawInfo{" + "game='" + game + '\'' + ", expect='" + expect + '\'' + ", drawDate=" + drawDate + ", drawResult=" + drawResult + ", poolBalance=" + poolBalance + ", source=" + source + '}'; } @Override public boolean equals(Object o) { if (this == o) return true; if (o == null || getClass() != o.getClass()) return false; DrawInfo drawInfo = (DrawInfo) o; return Objects.equals(game, drawInfo.game) && Objects.equals(expect, drawInfo.expect) && Objects.equals(drawDate, drawInfo.drawDate) && Objects.equals(drawResult, drawInfo.drawResult); } @Override public int hashCode() { return Objects.hash(game, expect, drawDate, drawResult); } }
现在我们已经爬取到了需要的数据,自定义Pipeline可以自己处理爬取的结果
@Component public class DrawResultPipeline implements Pipeline { private final Logger logger = LoggerFactory.getLogger(DrawResultPipeline.class); /** * pipeline处理数据 * @param resultItems * @param task */ @Override public synchronized void process(ResultItems resultItems, Task task) { Map<String, Object> map = resultItems.getAll(); logger.info("爬取数据结果:{}", map); //noinspection unchecked List<DrawInfo> results = (List<DrawInfo>) map.get("results"); //TODO: 持久化到数据库 } }
为了能及时的获取最新的数据,我们设置一个定时任务,每间隔一段时间爬取一次
在springboot中可以很容易实现定时任务(百度搜索:springboot定时任务)
/** * 定时任务爬取开奖结果 */ @Component public class SchedulerTask { private final Logger logger = LoggerFactory.getLogger(SchedulerTask.class); //注入自定义的Pipeline,传给WebMagic的Spider @Resource private DrawResultPipeline drawResultPipeline; /** * 定时爬取开奖结果 */ @Scheduled(cron = "0 0/2 8-23 * * ?") public void fetch() throws Exception { Spider.create(new TcOrgProcessor()).setExitWhenComplete(true).addPipeline(drawResultPipeline).start(); //TODO: 添加其他源的爬虫 } }
至此爬取,持久化的流程就结束了。
其他源只是PageProcessor不同,持久化的过程是相同的,所以只用写对应的PageProcessor即可,完成后PageProcessor后添加到定时任务即可定时爬取
500网的PageProcessor
/**
* 这里的数据是在页面中提取的,需要用到xpath或正则表达式抽取想要的数据
* 配合chrome浏览器的F12,查看页面源码,一步步抽取想要的数据 * http://kaijiang.500.com/ */ public class WubaiProcessor implements PageProcessor { private final Logger logger = LoggerFactory.getLogger(WubaiProcessor.class); private static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd"); public final static String START_URL = "http://kaijiang.500.com"; private final Site site = Site.me();
@Override public void process(Page page) { //开奖表的根节点 Selectable rootNode = page.getHtml().xpath("//table[@class=kj_tablelist01]/tbody"); List<DrawInfo> drawInfos = new ArrayList<>(); //大乐透 try { Selectable dltNode = rootNode.xpath("//tr[@id=dlt]"); String drawNum = dltNode.xpath("//td[2]/text()").replace("期", "").toString().trim(); String strDrawDate = dltNode.xpath("//td[3]/text()").toString().trim(); LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT); String strDrawResult = dltNode.xpath("//td[4]/script").regex("formatResult\\('dlt','(.*)'\\)", 1).toString().trim(); strDrawResult = strDrawResult.replace("|", ","); String poolBalance = dltNode.xpath("//td[5]/script").regex("formatCCMoney\\('dlt','(.*)'\\)", 1).toString().trim(); logger.info("大乐透:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance); //构造开奖对象 List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList()); int poolIntValue = new BigDecimal(poolBalance).intValue(); DrawInfo dltInfo = new DrawInfo("大乐透", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM); drawInfos.add(dltInfo); } catch (Exception e) { logger.error("大乐透解析页面异常:{}", e.getMessage()); } //7星彩 try { Selectable qxcNode = rootNode.xpath("//tr[@id=qxc]"); String drawNum = qxcNode.xpath("//td[2]/text()").replace("期", "").toString().trim(); String strDrawDate = qxcNode.xpath("//td[3]/text()").toString().trim(); LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT); String strDrawResult = qxcNode.xpath("//td[4]/script").regex("formatResult\\('qxc','(.*)'\\)", 1).toString().trim(); String poolBalance = qxcNode.xpath("//td[5]/script").regex("formatCCMoney\\('qxc','(.*)'\\)", 1).toString().trim(); logger.info("7星彩:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance); //构造开奖对象 List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList()); int poolIntValue = new BigDecimal(poolBalance).intValue(); DrawInfo qxcInfo = new DrawInfo("7星彩", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM); drawInfos.add(qxcInfo); } catch (Exception e) { logger.error("7星彩解析页面异常:{}", e.getMessage()); } //排列5 try { Selectable plwNode = rootNode.xpath("//tr[@id=plw]"); String drawNum = plwNode.xpath("//td[2]/text()").replace("期", "").toString().trim(); String strDrawDate = plwNode.xpath("//td[3]/text()").toString().trim(); LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT); String strDrawResult = plwNode.xpath("//td[4]/script").regex("formatResult\\('plw','(.*)'\\)", 1).toString().trim(); String poolBalance = plwNode.xpath("//td[5]/script").regex("formatCCMoney\\('plw','(.*)'\\)", 1).toString().trim(); logger.info("排列5:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance); //构造开奖对象 List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList()); int poolIntValue = new BigDecimal(poolBalance).intValue(); DrawInfo plwInfo = new DrawInfo("排列5", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM); drawInfos.add(plwInfo); } catch (Exception e) { logger.error("排列5解析页面异常:{}", e.getMessage()); } //排列3 try { Selectable plwNode = rootNode.xpath("//tr[@id=pls]"); String drawNum = plwNode.xpath("//td[2]/text()").replace("期", "").toString().trim(); String strDrawDate = plwNode.xpath("//td[3]/text()").toString().trim(); LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT); String strDrawResult = plwNode.xpath("//td[4]/script").regex("formatResult\\('pls','(.*)'\\)", 1).toString().trim(); String poolBalance = plwNode.xpath("//td[5]/script").regex("formatCCMoney\\('pls','(.*)'\\)", 1).toString().trim(); logger.info("排列3:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance); //构造开奖对象 List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList()); int poolIntValue = new BigDecimal(poolBalance).intValue(); DrawInfo plsInfo = new DrawInfo("排列3", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM); drawInfos.add(plsInfo); } catch (Exception e) { logger.error("排列3解析页面异常:{}", e.getMessage()); } page.putField("results", drawInfos); } @Override public Site getSite() { return site; } public static void main(String[] args) { Spider.create(new WubaiProcessor()).addUrl(START_URL).run(); } }
至此,我们使用WebMagic得到了想要的数据,持久化到数据库的示例
欢迎学习交流