[Spark] DataFram读取JSON文件异常 出现 Since Spark 2.3, the queries from raw JSON/CSV files are disallowed...

 

在IDEA中运行Scala脚本访问执行SparkSQL时:

df.show()

 

出现报错信息:

 1 19/12/06 14:26:17 INFO SparkContext: Created broadcast 2 from show at Student.scala:16
 2 Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
 3 referenced columns only include the internal corrupt record column
 4 (named _corrupt_record by default). For example:
 5 spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
 6 and spark.read.schema(schema).json(file).select("_corrupt_record").show().
 7 Instead, you can cache or save the parsed results and then send the same query.
 8 For example, val df = spark.read.schema(schema).json(file).cache() and then
 9 df.filter($"_corrupt_record".isNotNull).count().;
10     at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:120)
11     at org.apache.spark.sql.execution.datasources.FileFormat$class.buildReaderWithPartitionValues(FileFormat.scala:129)
12     at org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:165)
13     at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:309)
14     at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:305)
15     at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:327)
16     at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
17     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
18     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
19     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
20     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
21     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
22     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
23     at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
24     at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
25     at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
26     at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
27     at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
28     at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
29     at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
30     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
31     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
32     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
33     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
34     at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
35     at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
36     at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
37     at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
38     at org.apache.spark.sql.Dataset.show(Dataset.scala:751)
39     at org.apache.spark.sql.Dataset.show(Dataset.scala:710)
40     at org.apache.spark.sql.Dataset.show(Dataset.scala:719)
41     at Student$.main(Student.scala:16)
42     at Student.main(Student.scala)

 

因为我的JSON格式是多行的,只需要改为一行即可

{
  "name": "Michael",
  "age": 12
}
{
  "name": "Andy",
  "age": 13
}
{
  "name": "Justin",
  "age": 8
}

 

修改为:

{"name": "Michael",  "age": 12}
{"name": "Andy",  "age": 13}
{"name": "Justin",  "age": 8}

 

posted @ 2019-12-06 14:34  1440min  阅读(3292)  评论(2编辑  收藏  举报