Andy 胡

导航

网络爬虫2:使用crawler4j爬取网络内容

https://github.com/yasserg/crawler4j

需要两个包:

  crawler4j-4.1-jar-with-dependencies.jar

  slf4j-simple-1.7.22.jar(如果不加,会有警告:SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".)

 相关包下载:

http://download.csdn.net/detail/talkwah/9747407

 

(crawler4j-4.1-jar-with-dependencies.jar相关资料少,github下载半天还失败,故整理了一下)

参考资料:

http://blog.csdn.net/zjm131421/article/details/13093869

 

http://favccxx.blog.51cto.com/2890523/1691079/

 

import java.util.Set;
import java.util.regex.Pattern;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
import edu.uci.ics.crawler4j.url.WebURL;

public class AhCrawler extends WebCrawler {
    // 三要素:
    // _访问谁?
    // _怎么访?
    // _访上了怎么处置?
    private static final String C_URL = "http://www.ximalaya.com";

    @Override
    public boolean shouldVisit(Page referringPage, WebURL url) {
        String href = url.getURL().toLowerCase();
        // 不匹配:MP3|jpg|png结尾的资源
        Pattern p = Pattern.compile(".*(\\.(MP3|jpg|png))$");
        return !p.matcher(href).matches() && href.startsWith(C_URL);
    }

    @Override
    public void visit(Page page) {

        String url = page.getWebURL().getURL();
        String parentUrl = page.getWebURL().getParentUrl();
        String anchor = page.getWebURL().getAnchor();
        System.out.println("↓↓↓↓↓↓↓↓↓");
        System.out.println("URL        :" + url);
        System.out.println("Parent page:" + parentUrl);
        System.out.println("Anchor text:" + anchor);

        logger.info("URL: {}", url);
        logger.debug("Parent page: {}", parentUrl);
        logger.debug("Anchor text: {}", anchor);

        if (page.getParseData() instanceof HtmlParseData) {
            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
            String text = htmlParseData.getText();
            String html = htmlParseData.getHtml();
            Set<WebURL> links = htmlParseData.getOutgoingUrls();
            System.out.println("--------------------------");
            // System.out.println("Text length: " + text.length());
            // System.out.println("Html length: " + html.length());
            System.out.println("Number of outgoing links: " + links.size());
        }
        System.out.println("↑↑↑↑↑↑↑");
    }

    public static void main(String[] args) throws Exception {
        // 源代码例子中,这两位是两只参数
        // 配置个路径,这个路径相当于Temp文件夹,不用先建好,
        String crawlStorageFolder = "/data/crawl/root";
        int numberOfCrawlers = 7;

        CrawlConfig crawlConf = new CrawlConfig();
        crawlConf.setCrawlStorageFolder(crawlStorageFolder);
        PageFetcher pageFetcher = new PageFetcher(crawlConf);

        RobotstxtConfig robotConf = new RobotstxtConfig();
        RobotstxtServer robotServ = new RobotstxtServer(robotConf, pageFetcher);

        // 控制器
        CrawlController c = new CrawlController(crawlConf, pageFetcher,
                robotServ);
        // 添加网址
        c.addSeed(C_URL);

        // 启动爬虫
        c.start(AhCrawler.class, numberOfCrawlers);
    }
}

 

CrawlController c 的来历:

结果示例:

↓↓↓↓↓↓↓↓↓
URL        :http://www.ximalaya.com/5333001/sound/25320285
Parent page:http://www.ximalaya.com/dq/music-ACG/
Anchor text:俊豪演奏 - 琵琶版《刀劍如夢》
[Crawler 3] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5333001/sound/25320285
--------------------------
Number of outgoing links: 131
↑↑↑↑↑↑↑
[Crawler 7] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/30119950/sound/12181402
↓↓↓↓↓↓↓↓↓
URL        :http://www.ximalaya.com/30119950/sound/12181402
Parent page:http://www.ximalaya.com/dq/book-果麦文化/
Anchor text:第二十六集 人生的意思不在于留下什么,而在于经历
--------------------------
Number of outgoing links: 134
↑↑↑↑↑↑↑
↓↓↓↓↓↓↓↓↓
URL        :http://www.ximalaya.com/zhubo/56833971/
Parent page:http://www.ximalaya.com/4932085/sound/21902925
Anchor text:null
[Crawler 1] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/zhubo/56833971/
--------------------------
Number of outgoing links: 68
↑↑↑↑↑↑↑
[Crawler 4] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5413571/sound/2349697
↓↓↓↓↓↓↓↓↓
URL        :http://www.ximalaya.com/5413571/sound/2349697
Parent page:http://www.ximalaya.com/dq/renwen-新知/
Anchor text:41-方明-西江月·夜行黄沙道中 南宋 辛弃疾
--------------------------
Number of outgoing links: 134
↑↑↑↑↑↑↑
[Crawler 6] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5011186/sound/30650945
↓↓↓↓↓↓↓↓↓
URL        :http://www.ximalaya.com/5011186/sound/30650945
Parent page:http://www.ximalaya.com/dq/finance-大咖/
Anchor text:03
--------------------------
Number of outgoing links: 111
↑↑↑↑↑↑↑
↓↓↓↓↓↓↓↓↓
URL        :http://www.ximalaya.com/1000144/album/3559805
Parent page:http://www.ximalaya.com/dq/music-文艺/
Anchor text:null
[Crawler 2] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/1000144/album/3559805
--------------------------
Number of outgoing links: 85
↑↑↑↑↑↑↑
↓↓↓↓↓↓↓↓↓
URL        :http://www.ximalaya.com/4932085/sound/21902925/liker
Parent page:http://www.ximalaya.com/4932085/sound/21902925
Anchor text:更多
[Crawler 1] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/4932085/sound/21902925/liker
--------------------------
Number of outgoing links: 96
↑↑↑↑↑↑↑
↓↓↓↓↓↓↓↓↓
URL        :http://www.ximalaya.com/30895669/sound/19945445
Parent page:http://www.ximalaya.com/dq/music-ACG/
Anchor text:宫崎骏-久石让
[Crawler 3] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/30895669/sound/19945445
--------------------------
Number of outgoing links: 131
↑↑↑↑↑↑↑
↓↓↓↓↓↓↓↓↓
URL        :http://www.ximalaya.com/9112346/album/2903291
Parent page:http://www.ximalaya.com/dq/book-果麦文化/
Anchor text:null
[Crawler 7] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/9112346/album/2903291
--------------------------
Number of outgoing links: 90
↑↑↑↑↑↑↑

 

 


posted on 2017-02-05 14:46  talkwah  阅读(485)  评论(0编辑  收藏  举报