随笔分类 -  Spark大数据

摘要:val rddFromFile = spark.sparkContext.textFile("test.txt").collect().mkString("\n") 注:本地文件的话,这里用相对路径和绝对路径都可以,或直接传hdfs路径 取Array[String]的第一个元素: val rddFr 阅读全文
posted @ 2021-06-30 20:09 船长博客 阅读(772) 评论(0) 推荐(0) 编辑
摘要:df.withColumn("Test", lit(null)).show() + + + + + |Hour|Category|Value|Test| + + + + + | 0| cat26| 30.9|null| | 1| cat67| 28.5|null| | 2| cat56| 39.6| 阅读全文
posted @ 2021-06-23 18:51 船长博客 阅读(251) 评论(0) 推荐(0) 编辑
摘要:查看所有的表 :list 查看表中所有数据:scan 'staff' 前10条: scan 'test-table',{'LIMIT' => 10} 后10条: scan 'test-table',{'LIMIT' => 10, REVERSED => TRUE} 查看表结构:desc 'staff 阅读全文
posted @ 2021-06-04 18:16 船长博客 阅读(207) 评论(0) 推荐(0) 编辑
摘要:scan 'test-table',{'LIMIT' => 10, REVERSED => TRUE} 阅读全文
posted @ 2021-06-04 18:03 船长博客 阅读(615) 评论(0) 推荐(0) 编辑
摘要:scan 'test-table', {'LIMIT' => 10} 阅读全文
posted @ 2021-06-04 17:43 船长博客 阅读(5841) 评论(0) 推荐(1) 编辑
摘要:创建DataFrameF示例 val df = sc.parallelize(Seq( | (0,"cat26","cat26"), | (1,"cat67","cat26"), | (2,"cat56","cat26"), | (3,"cat8","cat26"))).toDF("Hour", " 阅读全文
posted @ 2021-06-03 18:08 船长博客 阅读(2052) 评论(0) 推荐(0) 编辑
摘要:一,创建Dataframe scala> val df = sc.parallelize(Seq( | | (0,"cat26",30.9), | | (1,"cat67",28.5), | | (2,"cat56",39.6), | | (3,"cat8",35.6))).toDF("Hour", 阅读全文
posted @ 2021-04-27 17:09 船长博客 阅读(304) 评论(0) 推荐(0) 编辑
摘要:构造一个dataframe import org.apache.spark.sql._ import org.apache.spark.sql.types._ val data = Array(List("Category A", 100, "This is category A"), List(" 阅读全文
posted @ 2021-04-13 19:48 船长博客 阅读(1280) 评论(0) 推荐(1) 编辑
摘要:.na.drop("all", Seq("create_time")) 阅读全文
posted @ 2021-03-12 07:34 船长博客 阅读(365) 评论(0) 推荐(0) 编辑
摘要:将以下内容保存为small_zipcode.csv id,zipcode,type,city,state,population 1,704,STANDARD,,PR,30100 2,704,,PASEO COSTA DEL SUR,PR, 3,709,,BDA SAN LUIS,PR,3700 4, 阅读全文
posted @ 2021-01-07 20:44 船长博客 阅读(2434) 评论(0) 推荐(1) 编辑
摘要:删除表中全部为NaN的行 df.na.drop("all") 删除表任一列中有NaN的行 df.na.drop("any") 示例: scala> df.show + + + + + + + | id|zipcode| type| city|state|population| + + + + + + 阅读全文
posted @ 2021-01-07 20:39 船长博客 阅读(2202) 评论(0) 推荐(1) 编辑
摘要:scala> val a = Seq(("a", 2),("b",3)).toDF("name","score") a: org.apache.spark.sql.DataFrame = [name: string, score: int] scala> a.show() + + + |name|s 阅读全文
posted @ 2021-01-07 13:53 船长博客 阅读(1851) 评论(0) 推荐(1) 编辑
摘要:library(datasets) summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.1 阅读全文
posted @ 2020-12-31 13:43 船长博客 阅读(2486) 评论(0) 推荐(0) 编辑
摘要:val goalsDF = Seq( ("messi", 2), ("messi", 1), ("pele", 3), ("pele", 1) ).toDF("name", "goals") goalsDF.show() + + + | name|goals| + + + |messi| 2| |m 阅读全文
posted @ 2020-12-30 18:12 船长博客 阅读(1418) 评论(0) 推荐(2) 编辑
摘要:val goalsDF = Seq( ("messi", 2), ("messi", 1), ("pele", 3), ("pele", 1) ).toDF("name", "goals") goalsDF.show() + + + | name|goals| + + + |messi| 2| |m 阅读全文
posted @ 2020-12-30 18:03 船长博客 阅读(1026) 评论(0) 推荐(0) 编辑
摘要:import org.apache.spark.sql.functions.{row_number, max, broadcast} import org.apache.spark.sql.expressions.Window val df = sc.parallelize(Seq( (0,"cat 阅读全文
posted @ 2020-12-30 11:32 船长博客 阅读(545) 评论(0) 推荐(0) 编辑
摘要:scala> val df = sc.parallelize(Seq( (0,"cat26",30.9), (1,"cat67",28.5), (2,"cat56",39.6), (3,"cat8",35.6))).toDF("Hour", "Category", "Value") scala> d 阅读全文
posted @ 2020-12-29 20:22 船长博客 阅读(339) 评论(0) 推荐(0) 编辑
摘要:scala> val df = sc.parallelize(Seq( | (0,"cat26",30.9), | (1,"cat67",28.5), | (2,"cat56",39.6), | (3,"cat8",35.6))).toDF("Hour", "Category", "Value") 阅读全文
posted @ 2020-12-29 20:20 船长博客 阅读(1194) 评论(0) 推荐(1) 编辑
摘要:val df = sc.parallelize(Seq( (0,"cat26",30.9), (1,"cat67",28.5), (2,"cat56",39.6), (3,"cat8",35.6))).toDF("Hour", "Category", "Value") //或者从文件读取成List 阅读全文
posted @ 2020-12-29 20:14 船长博客 阅读(2115) 评论(0) 推荐(1) 编辑

永远相信美好的事情即将发生!
点击右上角即可分享
微信分享提示