继续北京实验一

 忘记JDK安装位置,创建快捷操作时就会造成找不到JDK然后启动失败。

 自己安装完了就忘记。

修改了路径然后用快捷方式启动成功

 

 

 运行的代码为

package my.webmagic;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.FilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.FileCacheQueueScheduler;

public class Getgov implements PageProcessor{
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);

@Override
public Site getSite() {
// TODO Auto-generated method stub
return site;
}

@Override
public void process(Page page) {
// TODO Auto-generated method stub
page.putField("allhtml", page.getHtml().toString());
String urlstr = null;
for (int i = 1; i < 2702; i++) {
urlstr = "http://rexian.beijing.gov.cn/default/com.web.index.moreNewLetterQuery.flow?PageCond/currentPage="
+ i + "&type=nextPage";
page.addTargetRequest(urlstr);
page.addTargetRequests(page.getHtml().links()
.regex("(com.web.\\w+.\\w+.flow\\?originalId=\\w+)").all());}

}

public static void main(String[] args) {
Spider.create(new Getgov())
.addUrl("http://rexian.beijing.gov.cn/default/com.web.index.moreNewLetterQuery.flow?type=firstPage")
.addPipeline(new FilePipeline("/home/hadoop/data/edu1"))
.setScheduler(new FileCacheQueueScheduler("/home/hadoop/data/edu1"))
.thread(5)
.run();}

}

 

posted on 2024-01-23 23:00  夜的第七章i  阅读(4)  评论(0编辑  收藏  举报