WebMagic简单使用,爬取体彩开奖结果

Java爬虫有很多,WebMagic是其中一个,文档齐全,入门简单,个人用来爬取一些小数据很不错,以下以爬取彩票开奖结果为例,介绍一下基本用法。

WebMagic官网文档http://webmagic.io/docs/zh/,文档很细致,通过实例介绍了一个完整的爬取过程,并持久化爬取结果。

WebMagic封装的很好,一般来说我们只用定义自己的PageProcessor(用于提取数据),Pipeline(用于处理提取的数据,如持久化)

下面依葫芦画瓢,我们来爬取彩票的开奖结果,以下内容仅限个人学习使用

需求:爬取彩票的开奖结果,并写入数据库

我们基于springboot框架开始,springboot可以方便的执行定时爬取,结合mybatis把数据写入数据库

----------------------------我是分割线---------------------------

爬取的源:体彩官网(https://www.lottery.gov.cn/),500彩票(https://kaijiang.500.com/),新浪爱彩(https://kaijiang.aicai.com/

体彩的玩法有:大乐透,7星彩,排列3,排列5

打开体彩官网,首页可以看到最近一期各种玩法的开奖结果

 

 

 我们点击前面的各个玩法,可以进去看详情,至于为何要点击进去,作为初学者,单个玩法单独处理可能会简单明了

打开大乐透详情页面(https://www.lottery.gov.cn/dlt/index.html),chrome浏览器按F12打开开发工具,刷新页面,看看请求的过程

逐一观察请求,发现这个请求

 

 

 这个请求返回的是JSON,数据完全符合我们的需求,直接利用就好了

这时你可能会质疑为啥你要先看请求的过程,而不是分析页面的内容,其实在找到这个请求之前,我也分析过页面,页面的源码中并没有开奖的数据,所有我断定数据是通过后加载的方式填入页面的,想到这里当然要看请求咯

JSON数据最好不过了,反解析后直接使用,少了在HTML中提取数据的过程,核心代码如下:

定义TcOrgProcessor类,写如何提取我们需要的数据

public class TcOrgProcessor implements PageProcessor {
    private final Logger logger = LoggerFactory.getLogger(TcOrgProcessor.class);
    private static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");

    public static final String DLT_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=85,0&isVerify=1";
    public static final String QXC_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=04,0&isVerify=1";
    public static final String PL5_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=35,0;350133,0&isVerify=1";

    private final Site site = Site.me();

    @Override
    public void process(Page page) {
        String url = page.getUrl().toString();
        //预处理返回结果
        String text = page.getRawText();
        JSONObject rootInfo = JSON.parseObject(text);
        if (rootInfo.getIntValue("errorCode") != 0) {
            page.setSkip(true);
            logger.error("请求结果错误,URL=>{},内容=>{}", url, text);
            return;
        }
        //读取value字段
        JSONObject valueObject = rootInfo.getJSONObject("value");
        //大乐透,七星彩,排列的开奖结果模型相同,只是字段不同
        try {
            String[] keys = new String[] { "dlt", "qxc", "plw", "pls" };
            List<DrawInfo> drawInfos = new ArrayList<>();
            for (String key : keys) {
                JSONObject drawInfoObject = valueObject.getJSONObject(key);
                if (drawInfoObject == null || drawInfoObject.isEmpty())
                    continue;
                //处理结果
                String gameName = drawInfoObject.getString("lotteryGameName");
                String drawNum = drawInfoObject.getString("lotteryDrawNum");
                String strDrawResult = drawInfoObject.getString("lotteryDrawResult").replaceAll(" ", ",");
                LocalDate drawDate = LocalDate.parse(drawInfoObject.getString("lotteryDrawTime"), DATE_FORMAT);
                String poolBalance = drawInfoObject.getString("poolBalanceAfterdraw").replaceAll(",", "");
                //构造开奖信息模型
                List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
                int poolIntValue = new BigDecimal(poolBalance).intValue();
                DrawInfo drawInfo = new DrawInfo(gameName, drawNum, drawDate, drawResult, poolIntValue, Source.TC_ORG);
                drawInfos.add(drawInfo);
            }
            //存入结果集
            page.putField("results", drawInfos);
        } catch (Exception e) {
            logger.error("解析异常:{}", e.getMessage());
            page.setSkip(true);
        }
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new TcOrgProcessor()).addUrl(DLT_URL).addUrl(QXC_URL).addUrl(PL5_URL).addPipeline(new ConsolePipeline()).run();
    }

DrawInfo类是开奖信息模型,我们将提取的数据,标准化成这个模型,方便在后续的Pipeline中使用

/**
 * 开奖信息模型
 */
public class DrawInfo {
    //游戏名称[大乐透,7星彩,排列5]
    private String game;
    //期号[21035]
    private String expect;
    //开奖日期[2021-03-15]
    private LocalDate drawDate;
    //开奖结果[1,2,3,4,5]
    private List<Integer> drawResult;
    //奖池[188827520]
    private int poolBalance;
    //采集来源
    private Source source;

    public String getGame() {
        return game;
    }

    public void setGame(String game) {
        this.game = game;
    }

    public String getExpect() {
        return expect;
    }

    public void setExpect(String expect) {
        this.expect = expect;
    }

    public LocalDate getDrawDate() {
        return drawDate;
    }

    public void setDrawDate(LocalDate drawDate) {
        this.drawDate = drawDate;
    }

    public List<Integer> getDrawResult() {
        return drawResult;
    }

    public void setDrawResult(List<Integer> drawResult) {
        this.drawResult = drawResult;
    }

    public int getPoolBalance() {
        return poolBalance;
    }

    public void setPoolBalance(int poolBalance) {
        this.poolBalance = poolBalance;
    }

    public Source getSource() {
        return source;
    }

    public void setSource(Source source) {
        this.source = source;
    }

    public DrawInfo(String game, String expect, LocalDate drawDate, List<Integer> drawResult, int poolBalance, Source source) {
        this.game = game;
        this.expect = expect;
        this.drawDate = drawDate;
        this.drawResult = drawResult;
        this.poolBalance = poolBalance;
        this.source = source;
    }

    @Override
    public String toString() {
        return "DrawInfo{" + "game='" + game + '\'' + ", expect='" + expect + '\'' + ", drawDate=" + drawDate + ", drawResult=" + drawResult
                + ", poolBalance=" + poolBalance + ", source=" + source + '}';
    }

    @Override
    public boolean equals(Object o) {
        if (this == o)
            return true;
        if (o == null || getClass() != o.getClass())
            return false;
        DrawInfo drawInfo = (DrawInfo) o;
        return Objects.equals(game, drawInfo.game) && Objects.equals(expect, drawInfo.expect) && Objects.equals(drawDate, drawInfo.drawDate)
                && Objects.equals(drawResult, drawInfo.drawResult);
    }

    @Override
    public int hashCode() {
        return Objects.hash(game, expect, drawDate, drawResult);
    }
}

现在我们已经爬取到了需要的数据,自定义Pipeline可以自己处理爬取的结果

@Component
public class DrawResultPipeline implements Pipeline {
    private final Logger logger = LoggerFactory.getLogger(DrawResultPipeline.class);

 /**
     * pipeline处理数据
     * @param resultItems
     * @param task
     */
    @Override
    public synchronized void process(ResultItems resultItems, Task task) {
        Map<String, Object> map = resultItems.getAll();
        logger.info("爬取数据结果:{}", map);
        //noinspection unchecked
        List<DrawInfo> results = (List<DrawInfo>) map.get("results");
        //TODO: 持久化到数据库
    }
}

为了能及时的获取最新的数据,我们设置一个定时任务,每间隔一段时间爬取一次

在springboot中可以很容易实现定时任务(百度搜索:springboot定时任务)

/**
 * 定时任务爬取开奖结果
 */
@Component
public class SchedulerTask {
    private final Logger logger = LoggerFactory.getLogger(SchedulerTask.class);
    //注入自定义的Pipeline,传给WebMagic的Spider
    @Resource private DrawResultPipeline drawResultPipeline;

    /**
     * 定时爬取开奖结果
     */
    @Scheduled(cron = "0 0/2 8-23 * * ?")
    public void fetch() throws Exception {
        Spider.create(new TcOrgProcessor()).setExitWhenComplete(true).addPipeline(drawResultPipeline).start();
       //TODO: 添加其他源的爬虫  
    }
}

至此爬取,持久化的流程就结束了。

其他源只是PageProcessor不同,持久化的过程是相同的,所以只用写对应的PageProcessor即可,完成后PageProcessor后添加到定时任务即可定时爬取

 500网的PageProcessor

/**
* 这里的数据是在页面中提取的,需要用到xpath或正则表达式抽取想要的数据
* 配合chrome浏览器的F12,查看页面源码,一步步抽取想要的数据 *
http://kaijiang.500.com/ */ public class WubaiProcessor implements PageProcessor { private final Logger logger = LoggerFactory.getLogger(WubaiProcessor.class); private static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd"); public final static String START_URL = "http://kaijiang.500.com"; private final Site site = Site.me();
@Override
public void process(Page page) { //开奖表的根节点 Selectable rootNode = page.getHtml().xpath("//table[@class=kj_tablelist01]/tbody"); List<DrawInfo> drawInfos = new ArrayList<>(); //大乐透 try { Selectable dltNode = rootNode.xpath("//tr[@id=dlt]"); String drawNum = dltNode.xpath("//td[2]/text()").replace("期", "").toString().trim(); String strDrawDate = dltNode.xpath("//td[3]/text()").toString().trim(); LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT); String strDrawResult = dltNode.xpath("//td[4]/script").regex("formatResult\\('dlt','(.*)'\\)", 1).toString().trim(); strDrawResult = strDrawResult.replace("|", ","); String poolBalance = dltNode.xpath("//td[5]/script").regex("formatCCMoney\\('dlt','(.*)'\\)", 1).toString().trim(); logger.info("大乐透:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance); //构造开奖对象 List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList()); int poolIntValue = new BigDecimal(poolBalance).intValue(); DrawInfo dltInfo = new DrawInfo("大乐透", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM); drawInfos.add(dltInfo); } catch (Exception e) { logger.error("大乐透解析页面异常:{}", e.getMessage()); } //7星彩 try { Selectable qxcNode = rootNode.xpath("//tr[@id=qxc]"); String drawNum = qxcNode.xpath("//td[2]/text()").replace("期", "").toString().trim(); String strDrawDate = qxcNode.xpath("//td[3]/text()").toString().trim(); LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT); String strDrawResult = qxcNode.xpath("//td[4]/script").regex("formatResult\\('qxc','(.*)'\\)", 1).toString().trim(); String poolBalance = qxcNode.xpath("//td[5]/script").regex("formatCCMoney\\('qxc','(.*)'\\)", 1).toString().trim(); logger.info("7星彩:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance); //构造开奖对象 List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList()); int poolIntValue = new BigDecimal(poolBalance).intValue(); DrawInfo qxcInfo = new DrawInfo("7星彩", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM); drawInfos.add(qxcInfo); } catch (Exception e) { logger.error("7星彩解析页面异常:{}", e.getMessage()); } //排列5 try { Selectable plwNode = rootNode.xpath("//tr[@id=plw]"); String drawNum = plwNode.xpath("//td[2]/text()").replace("期", "").toString().trim(); String strDrawDate = plwNode.xpath("//td[3]/text()").toString().trim(); LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT); String strDrawResult = plwNode.xpath("//td[4]/script").regex("formatResult\\('plw','(.*)'\\)", 1).toString().trim(); String poolBalance = plwNode.xpath("//td[5]/script").regex("formatCCMoney\\('plw','(.*)'\\)", 1).toString().trim(); logger.info("排列5:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance); //构造开奖对象 List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList()); int poolIntValue = new BigDecimal(poolBalance).intValue(); DrawInfo plwInfo = new DrawInfo("排列5", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM); drawInfos.add(plwInfo); } catch (Exception e) { logger.error("排列5解析页面异常:{}", e.getMessage()); } //排列3 try { Selectable plwNode = rootNode.xpath("//tr[@id=pls]"); String drawNum = plwNode.xpath("//td[2]/text()").replace("期", "").toString().trim(); String strDrawDate = plwNode.xpath("//td[3]/text()").toString().trim(); LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT); String strDrawResult = plwNode.xpath("//td[4]/script").regex("formatResult\\('pls','(.*)'\\)", 1).toString().trim(); String poolBalance = plwNode.xpath("//td[5]/script").regex("formatCCMoney\\('pls','(.*)'\\)", 1).toString().trim(); logger.info("排列3:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance); //构造开奖对象 List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList()); int poolIntValue = new BigDecimal(poolBalance).intValue(); DrawInfo plsInfo = new DrawInfo("排列3", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM); drawInfos.add(plsInfo); } catch (Exception e) { logger.error("排列3解析页面异常:{}", e.getMessage()); } page.putField("results", drawInfos); } @Override public Site getSite() { return site; } public static void main(String[] args) { Spider.create(new WubaiProcessor()).addUrl(START_URL).run(); } }

至此,我们使用WebMagic得到了想要的数据,持久化到数据库的示例

 

 

欢迎学习交流

 

posted @ 2021-04-06 21:34  依天照海  阅读(730)  评论(0编辑  收藏  举报