An Introduction to Asynchronous Programming and Twisted (2)

摘要: Part 6: And Then We Took It Higher Part5中的client2.0, 在封装性上已经做的不错, 用户只需要了解和修改PoetryProtocol, PoetryClientFactory就可以完成一个应用. 其实此处, protocol的逻辑就是接受数据, 接受完以后通知factory处理, 这段逻辑已经可以作为common的框架代码, 用户无需改动. 真正需... 阅读全文
posted @ 2011-09-07 10:02 fxjwind 阅读(390) 评论(0) 推荐(0) 编辑

Mining of Massive Datasets – Link Analysis

摘要: 5.1 PageRank5.1.1 Early Search Engines and Term SpamAs people began to use search engines to find their way around the Web, unethical people saw the opportunity to fool search engines into leading people to their page.Techniques for fooling search engines into believing your page is about something 阅读全文
posted @ 2011-09-06 15:49 fxjwind 阅读(625) 评论(0) 推荐(0) 编辑

Mining of Massive Datasets – Mining Data Streams

摘要: Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make another assumption: data... 阅读全文
posted @ 2011-08-31 14:48 fxjwind 阅读(583) 评论(0) 推荐(0) 编辑

Bloom Filter Python

摘要: http://bitworking.org/news/380/bloom-filter-resourcesThe Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be adde 阅读全文
posted @ 2011-08-30 10:20 fxjwind 阅读(811) 评论(0) 推荐(0) 编辑

Mining of Massive Datasets – Data Mining

摘要: 1 What is Data Mining? The most commonly accepted definition of “data mining” is the discovery of “models” for data. 1.1 Statistical Modeling Statisticians were the first to use the term “data m... 阅读全文
posted @ 2011-08-29 15:00 fxjwind 阅读(565) 评论(0) 推荐(0) 编辑

Mining of Massive Datasets – Finding similar items

摘要: 在前面一篇blog中 (http://www.cnblogs.com/fxjwind/archive/2011/07/05/2098642.html), 我记录了相关的海量文档查同问题, 这儿就系统的来记录一下对于大规模数据挖掘技术而言, 怎样finding similar items……1 Applications of Near-Neighbor SearchThe Jaccard similarity of sets S and T is |S ∩ T |/|S ∪ T |, that is, the ratio of the size of the intersection of S 阅读全文
posted @ 2011-08-24 09:44 fxjwind 阅读(713) 评论(0) 推荐(0) 编辑

Filtering microblogging messages for Social TV

摘要: 论文摘要, Filtering microblogging messages for Social TV, A Bootstrapping Approach to Identifying Relevant Tweets for Social TVSocial TV was named one of the ten most important emerging technologies in 2010 by the MIT Technology Review.Social Television is a general term for technology that supports com 阅读全文
posted @ 2011-08-02 17:30 fxjwind 阅读(335) 评论(0) 推荐(0) 编辑

Design Pattern

摘要: 策略模式 策略模式定义了一系列的算法,并将每一个算法封装起来,而且使它们还可以相互替换。策略模式让算法独立于使用它的客户而独立变化。 场景, 对于输入, 在不同的情况下有不同的处理逻辑, 即有不用的算法 那么c的做法, 把各个算法封装成函数,仍然用大量的if... else...来判断, 条件不同使用不同的算法函数来处理. 面向对象的做法, 上面提到了工厂模式, 建个抽象算... 阅读全文
posted @ 2011-07-06 09:27 fxjwind 阅读(380) 评论(0) 推荐(0) 编辑

An Introduction to Asynchronous Programming and Twisted (1)

摘要: 之前看的时候, 总觉得思路不是很清晰, 其实Dave在这个模型问题上没有说清楚, 参考同步和异步, 阻塞和非阻塞, Reactor和Proactor 对于阻塞一定是同步的, 但是反之不一定, 对于多线程本质上也是阻塞的方式, 只不过是多个线程一起阻塞, 适用于CPU密集型的任务, 因为事情总要人做的, 无论什么模型都不能让做事情的实际时间变少. 对于非阻塞, 节省的是等待的时间, 所以适用于... 阅读全文
posted @ 2011-07-05 21:03 fxjwind 阅读(564) 评论(0) 推荐(0) 编辑

decruft(A library to extract meaningful data from a webpage) 源码分析

摘要: 开源Python模块, http://code.google.com/p/decruft/ decruft使用example, from decruft import Document #import urllib2 #f = urllib2.open('<em>url</em>') f = open('index.html', 'a') print Document(f.read()).sum... 阅读全文
posted @ 2011-07-05 21:01 fxjwind 阅读(421) 评论(0) 推荐(0) 编辑