java爬虫笔记:使用WebCollector增量采集www.baiduyunsousou.com

WebCollector可以配置短点爬取,历史数据根据Key去重,也就是url

 

最近在采集百度云网盘,记录一下

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
/**
 * @author Liu
 * @create 2022-08-02 11:48
 */
@Component
@Slf4j
public class DeepCrawler extends BaseCrawler {
 
    private CrawlerConfig crawlerConfig;
 
    @Override
    public void execute() {
        List<CrawlerConfig> crawlerConfigs = new ArrayList<>();
        if (this.crawlerConfig != null) {
            crawlerConfigs.add(this.crawlerConfig);
        } else {
            crawlerConfigs = this.crawlerConfigService.getDeepCrawlerConfig();
        }
 
        super.initCrawlerConfig(crawlerConfigs);
 
        //多站点多线程爬取
        for (CrawlerConfig config : crawlerConfigs) {
            try {
                if (SimpleCrawlerStoreMap.deepCrawlerThreadMap.get(config.getId()) == null) {
                    simpleCrawlerPool.execute(() -> {
                        DeepCrawlerThread deepCrawlerThread = new DeepCrawlerThread(config);
                        SimpleCrawlerStoreMap.deepCrawlerThreadMap.put(config.getId(), deepCrawlerThread);
                        deepCrawlerThread.setNextFilter(new HashSetNextFilter());
                        try {
                            deepCrawlerThread.start(config.getDeep());
                        } catch (Exception e) {
                            e.printStackTrace();
                            log.error(config.getSiteName() + "=>爬取任务异常");
                            log.error(e.getMessage(), e);
                        }
                    });
 
                } else {
                    log.info(config.getSiteName() + "=>爬取任务进行中……");
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
 
 
    public CrawlerConfig getCrawlerConfig() {
        return crawlerConfig;
    }
 
    public void setCrawlerConfig(CrawlerConfig crawlerConfig) {
        this.crawlerConfig = crawlerConfig;
    }
 
 
}

  

posted @   java小奔奔  阅读(65)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 地球OL攻略 —— 某应届生求职总结
· 周边上新:园子的第一款马克杯温暖上架
· Open-Sora 2.0 重磅开源!
· 提示词工程——AI应用必不可少的技术
· .NET周刊【3月第1期 2025-03-02】
历史上的今天:
2020-12-08 SpringBoot 加载远程图片

公众号【嗨呀搜索】

点击右上角即可分享
微信分享提示