10 2017 档案

StructuredStream StateStore机制

摘要：ref: https://jaceklaskowski.gitbooks.io/spark structured streaming/ StruncturedStream的statefule实现基于StateStore，能够记忆历史的结果，从而形成unbounded流式计算。其内部实际上是将历史的统阅读全文

posted @ 2017-10-26 11:15 wlu 阅读(684) 评论(0) 推荐(0) 编辑

Spark Structured Stream 2

摘要：❤Limitations of DStream API Batch Time Constraint application级别的设置。不支持EventTime event time 比process time更重要 Weak support for Dataset/Dataframe No cus 阅读全文

posted @ 2017-10-25 16:06 wlu 阅读(1315) 评论(0) 推荐(0) 编辑

saprk2 structed streaming

摘要：netcat (windows) nc L p 9999 Result: 窗口移动5秒，窗口宽度10秒。聚合维度： window, {world} http://asyncified.io/2017/07/30/exploring stateful streaming with spark str 阅读全文

posted @ 2017-10-24 15:58 wlu 阅读(745) 评论(0) 推荐(0) 编辑

神经网络拟合二次函数

摘要：调用Nndl实现的神经网络code，用ANN拟合二次方程。 ref: https://github.com/mnielsen/neural networks and deep learning 准备训练数据训练网络 a=[] f=[] for xi in np.array(xrange(0,100 阅读全文

posted @ 2017-10-20 13:36 wlu 阅读(2656) 评论(0) 推荐(0) 编辑

MLLib实践Naive Bayes

摘要：引言本文基于Spark (1.5.0) ml库提供的pipeline完整地实践一次文本分类。pipeline将串联单词分割(tokenize)、单词频数统计(TF)，特征向量计算(TF IDF)，朴素贝叶斯（Naive Bayes）模型训练等。本文将基于 "“20 NewsGroups”" 数据阅读全文

posted @ 2017-10-20 13:19 wlu 阅读(303) 评论(0) 推荐(0) 编辑

Debezium for PostgreSQL to Kafka

摘要：In this article, we discuss the necessity of segregate data model for read and write and use event sourcing for capture detailed data changing. These 阅读全文

posted @ 2017-10-20 13:18 wlu 阅读(3798) 评论(0) 推荐(0) 编辑

Apache Geode with Spark

摘要：在一些特定场景，例如streamingRDD需要和历史数据进行join从而获得一些profile信息，此时形成较小的新数据RDD和很大的历史RDD的join。 Spark中直接join实际上效率不高： RDD没有索引，join操作实际上是相互join的RDD进行hash然后shuffle到一起；实阅读全文

posted @ 2017-10-20 13:13 wlu 阅读(537) 评论(1) 推荐(0) 编辑

公告

昵称： wlu
园龄： 15年8个月
粉丝： 66
关注： 4

+加关注

2025年3月

日

一

二

三

四

五

六

Data and AI

10 2017 档案

公告

搜索

常用链接

我的标签

积分与排名

合集

随笔分类

随笔档案

相册

阅读排行榜

评论排行榜

推荐排行榜

最新评论