spark读取文件方式

一、调用hadoopfile方法读取TXT文件,针对复杂的分割方式,例如|+|,;等

val gbkPath = s"/bdtj/line/DD_OUT_NOW_LV_$month.txt"//文件路径

//将gbkPath以参数的形式传入进行读取
val Company2_temp = spark.sparkContext.hadoopFile(gbkPath, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], 1)
.map(p => new String(p._2.getBytes, 0, p._2.getLength, "UTF-8"))
val str: String = Company2_temp.first()//提取文件中表头第一条
val Company2 = Company2_temp.filter(!_.equals(str))//过滤掉文件第一行
.map(lines => {
val line: Array[String] = lines.split("\\|\\+\\|", -1)
(line(0).drop(1).dropRight(1), line(1).drop(1).dropRight(1), line(2).drop(1).dropRight(1), line(3).drop(1).dropRight(1), line(4).drop(1).dropRight(1), line(5).drop(1).dropRight(1), line(6).drop(1).dropRight(1), line(7).drop(1).dropRight(1), line(8).drop(1).dropRight(1), line(9).drop(1).dropRight(1), line(10).drop(1).dropRight(1), line(11).drop(1).dropRight(1), line(12).drop(1).dropRight(1), line(13).drop(1).dropRight(1), line(14).drop(1).dropRight(1), line(15).drop(1).dropRight(1), line(16).drop(1).dropRight(1), line(17).drop(1).dropRight(1), line(18).drop(1).dropRight(1), line(19).drop(1).dropRight(1), line(20).drop(1).dropRight(1), line(21).drop(1).dropRight(1))
}
).toDF("CITY_NAME", "W_NUM", "L_NUM", "N_NUM", "NR21_NUM", "NR35_NUM", "W_NOW_NUM", "L_NOW_NUM", "N_NOW_NUM", "NR21_NOW_NUM", "NR35_NOW_NUM", "L_LV", "NR21_LV", "NR35_LV", "E_LV", "L_BMD_NUM", "E_BMD_NUM", "L_LV_MONTH", "NR21_LV_MONTH", "NR35_LV_MONTH", "E_LV_MONTH", "DAY_NUM")
Company2.show(100, false)

二、调用read方法读取csv文件,这类文件格式较为简单大多以,分割
    val keda_3g: DataFrame = spark
.read
.format("csv")
.option("header", "true")//跳过第一行
.option("encoding", "gbk")//编码格式
.option("inferSchema", true.toString)//自动推测字段类型
      .option("delimiter", ",") //分隔符,默认为 ,
.load(s"/bdtj/3G_Traffic/3gwy/keda_3g_$yesterday.csv")
// .load(s"E:\\IDEA\\TianJin-ChinaUnicom\\data_1\\keda_3g_$yesterday.csv")
.toDF("time", "Location_area_code", "Cell_id", "manufactor",
"Community_name", "RLC_UP", "RLC_DOWN", "CS", "Send")



/** *   参数可以字符串,也可以是具体的类型,比如boolean
     * delimiter 分隔符,默认为逗号,
     * nullValue 指定一个字符串代表 null 值
     * quote 引号字符,默认为双引号"
     * header 第一行不作为数据内容,作为标题
     * inferSchema 自动推测字段类型
     * ignoreLeadingWhiteSpace 裁剪前面的空格
     * ignoreTrailingWhiteSpace 裁剪后面的空格
     * nullValue 空值设置,如果不想用任何符号作为空值,可以赋值null即可
     * multiline  运行多列,超过62 columns时使用
     * encoding   指定編码,如:gbk  / utf-8  Unicode  GB2312
     * ** */
posted @ 2022-11-02 09:27  tonggang_bigdata  阅读(436)  评论(0编辑  收藏  举报