雨渐渐

机器学习公开课~~~~mooc

摘要： https://class.coursera.org/ntumlone-001/class/index 阅读全文

posted @ 2013-12-17 16:42 雨渐渐阅读(150) 评论(0) 推荐(0) 编辑

摘要：参考源：http://www.cnblogs.com/morewindows/archive/2011/08/13/2137415.html__author__ = 'root'arr_in = [72, 6, 57, 88, 60, 42, 83, 73, 48, 85]def sort(start, end): if end - start = x: j -= 1 if i void quick_sort(int s[], int l, int r){ if(l = x) j --; ... 阅读全文

posted @ 2013-12-05 16:36 雨渐渐阅读(181) 评论(0) 推荐(0) 编辑

mapreduce (三) MapReduce实现倒排索引(二)

摘要： hadoop apihttp://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Reducer.html改变一下需求：要求“文档词频列表”是经过排序的，即出现次数高的再前思路：代码：package proj;import... 阅读全文

posted @ 2013-12-05 14:10 雨渐渐阅读(594) 评论(4) 推荐(0) 编辑

temp gbk2utf8

摘要： __author__ = 'root'# -*- coding: utf-8 -*-ps = '/data/poitestdata/行政地名.csv'pt = '/data/poitestdata/utf8_行政地名.csv'perr = '/data/poitestdata/err_行政地名.csv'f = open(ps)f_ok = open(pt, 'w')f_err = open(perr, 'w')count_err = 0count_ok = 0while True: line = f 阅读全文

posted @ 2013-12-03 09:55 雨渐渐阅读(176) 评论(0) 推荐(0) 编辑

mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次

摘要： 1 思路：0.txt MapReduce is simple1.txt MapReduce is powerfull is simple2.txt Hello MapReduce bye MapReduce1 map函数：context.write(word:docid, 1) 即将word:doc... 阅读全文

posted @ 2013-11-27 00:41 雨渐渐阅读(466) 评论(0) 推荐(0) 编辑

nutch fetcher.server.delay

摘要： 1 配置因素 fetcher.server.delay 0.0 The number of seconds the fetcher will delay between successive requests to the same server.2 机器人协议因素FetchItemQueue fiq = fetchQueues.getFetchItemQueue(fit.queueID); fiq.crawlDelay = rules.getCrawlDelay(); if (LOG.isDebugEnabled()) {... 阅读全文

posted @ 2013-11-25 16:34 雨渐渐阅读(221) 评论(0) 推荐(0) 编辑

nutch Fetcer阶段详解

摘要： job.setSpeculativeExecution(false); 抓网页阶段，不允许同一个任务运行多次，否则，网页就抓重了为了充分利用闲置资源，加快map 和 reduce 的执行，于是有SpeculativeExecution机制，同时运行多个map 或 reduce，先运行完的获胜，其他的干掉阅读全文

posted @ 2013-11-25 11:42 雨渐渐阅读(235) 评论(0) 推荐(0) 编辑

Hadoop 学习笔记（二） HDFS API

摘要： 4.删除HDFS上的文件package proj;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;public class DeleteFile { public static void main(String[] args) throws IOException { Configuration conf = new Configurati... 阅读全文

posted @ 2013-11-23 16:27 雨渐渐阅读(301) 评论(0) 推荐(0) 编辑

Hadoop 学习笔记（一） HDFS API

摘要： http://www.cnblogs.com/liuling/p/2013-6-17-01.html 这个也不错http://www.teamwiki.cn/hadoop/thrift thrift编程1.上传本地文件到ＨＤＦＳpackage proj;import org.apache.... 阅读全文

posted @ 2013-11-23 11:29 雨渐渐阅读(353) 评论(0) 推荐(0) 编辑

nutch getOutLinks 外链的处理

摘要：转载自：http://blog.csdn.net/witsmakemen/article/details/8067530通过跟踪发现，Fetcher获得网页解析链接没有问题，获得了网页中所有的链接，然后在output()函数中通过FetcherOutputFormat类输出（包含在ParseResu... 阅读全文

posted @ 2013-11-18 15:59 雨渐渐阅读(388) 评论(0) 推荐(0) 编辑