摘要:http://tez.incubator.apache.org/ http://dongxicheng.org/mapreduce-nextgen/apache-tez/ http://dongxicheng.org/mapreduce-nextgen/apache-tez-newest-progress/ Tez aims to be a general purpose execut...
阅读全文
摘要:http://www.socc2013.org/home/program http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/ Hadoop V1.0的问题 Hadoop被发明的时候是用于index海量的web crawls, 所以它很适应那个场景, 但是现在Hadoop被当作一种通用的计算平台, 这个已经...
阅读全文
摘要:这章主要描述怎样设计HBase schema. 关于这个主题, 强烈推荐下面这个presentation, 写的非常清晰. 首先再次强调的是, Nosql无法替代SQL, 对于非bigdata, 毫无疑问SQL更加好用. 对于系统或场景, 我们不应该执着的想着用Nosql去替代SQL, 而是仅仅将SQL无法handle那部分big data(往往关系性不强)放到Nosql上. ...
阅读全文
摘要:Seek vs. Transfer 我之前专门比较过B+ tree和LSM tree http://www.cnblogs.com/fxjwind/archive/2012/06/09/2543357.html 里面最后一篇blog比较好的分析使用B+ tree和LSM tree (Log-Structured Merge-Trees) 的本质, 读写效率的balance, 全局有序和局部有...
阅读全文
摘要:Advanced Features Filters HBase filters are a powerful feature that can greatly enhance your effectiveness working with data stored in tables. You will find predefined filters, already provided by ...
阅读全文
摘要:General Notes The primary client interface to HBase is the HTable class in the org.apache.hadoop.hbase.client package. It provides the user with all the functionality needed to store and retrieve...
阅读全文
摘要:Before we start looking into all the moving parts of HBase, let us pause to think about why there was a need to come up with yet another storage architecture. Relational database management systems (RDBMS) have been around since the early 1970s, and have helped countless companies and organizations
阅读全文
摘要:Counters There are often things you would like to know about the data you are analyzing but that are peripheral to the analysis you are performing. For example, if you were counting invalid records a...
阅读全文
摘要:Anatomy of a MapReduce Job Run Classic MapReduce (MapReduce 1) A job run in classic MapReduce is illustrated in Figure 6-1. At the highest level, there are four independent entities: • The clien...
阅读全文
摘要:GenericOptionsParser, Tool, and ToolRunnerHadoop comes with a few helper classes for making it easier to run jobs from the command line. GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your application to use as desired.
阅读全文
摘要:Data Integrity HDFS transparently checksums all data written to it and by default verifies checksums when reading data. A separate checksum is created for every io.bytes.per.checksum bytes of data. T...
阅读全文
摘要:The Hadoop Distributed Filesystem The Design of HDFS HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Let...
阅读全文
摘要:http://www.oreillynet.com/pub/au/4685 HBase: The Definitive Guide的作者 HBase Architecture 101 - Storage http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html HBase最隐秘的问题之一就是它的数据是...
阅读全文
摘要:The Google File System http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.125.789&rep=rep1&type=pdf http://www.dbthink.com/?p=501, 中文翻译 Google牛人云集的地方, 但在设计系统时, 却非常务实, 没有采用什么复杂和时髦的算法和机制 ...
阅读全文
摘要:bigtable: A Distributed Storage System for Structured Data http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf http://www.dbthink....
阅读全文
摘要:转篇blog, 因为里面图不错, 以后找的方便 http://cloud.csdn.net/a/20120220/312061.html 如今Apache Hadoop已成为大数据行业发展背后的驱动力。Hive和Pig等技术也经常被提到,但是他们都有什么功能,为什么会需要奇怪的名字(如Oozie,ZooKeeper、Flume)。 Hadoop带来了廉价的处理大数据(大数据的数据容...
阅读全文
摘要:The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing, including:Hadoop Core , our flagship sub-project, provides a distributed filesystem (HDFS) and su...
阅读全文
摘要:首先我们为什么需要Hadoop? The good news is that Big Data is here. The bad news is that we are struggling to store and analyze it. 面对海量的数据,我们需要高效的分析和存储他们,而Hadoop可以做到这点, This, in a nutshell, is what Hadoop p...
阅读全文