pyspark：'PipelinedRDD' object does not support indexing、 Initial job has not accepted any resources、IOException not a file: hdfs:// XXXX java.sql、Failed to replace a bad datanode on the existing

最近使用Pyspark的时候，遇到一些新的问题，希望记录下来，解决的我会补充。

1. WARN DomainSocketFactory: The short-circuit local reads feature cannot be used

2. pyspark TypeError: 'PipelinedRDD' object does not support indexing

该格式的RDD不能直接索引，但是可以通过其他方式实现：

方法一：使用take之后，再索引 —— some_rdd.take(10)[5] ：即表示取前10个中的索引为5的元素；

方法二：如果数据量较少，可以先 collect —— some_rdd.collect() 转化为array格式的数据，再索引；

方法三：通多lambda函数和map函数可以实现 —— some_rdd.map(lambda x: x)

3.WARN DFSClient: Failed to connect to /ip:port for block, add to deadNodes

据说是防火墙原因，但是本人尚未尝试。

4. WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

本人使用过：spark-submit --executor-memory 512M --total-executor-cores 2 test.py

但是这个方法没有解决这个问题，还在查找中。

原因：可能是内存不足造成的，可以用 free -m 查看一下节点的内存使用情况。

解决方法：

可以尝试方法一：在spark-env.sh中添加环境变量 —— export SPARK_EXECUTOR_MEMORY=512m

然后重启之后再执行。

可以尝试方法二：先清理内存，再执行，即依次执行以下三条命令：

sync    #写缓存到文件系统
echo 3 > /proc/sys/vm/drop_caches   #手动释放内存

# 其中：
# 0：不释放（系统默认值）
# 1：释放页缓存
# 2：释放dentries和inodes
# 3：释放所有缓存，即清除页面缓存、目录项和节点；

free -h     #查看是否已经清理

# 注：指定内存和核，--executor-memory 需要大于450MB, 也就是471859200B

5. java.io.IOException not a file: hdfs:// XXXX java.sql.SQLException

解决方法：在spark-sql命令行中，设置参数，即执行：

SET mapred.input.dir.recursive=true;

SET hive.mapred.supports.subdirectories=true;

原因：猜测是因为要读取的文件或者表在子目录导致。

6. java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being availableto try 在pyspark+kafka+sparkstreaming 测试时报错

解决方法：

方法一：修改hdfs-site.xml

修改hdfs-site.xml文件，添加或修改如下两项：

<name>dfs.client.block.write.replace-datanode-on-failure.enable</name>

</property>

<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>

<value>NEVER</value>

</property>

方法二：修改程序，加入配置

import os
from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql import SQLContext
from pyspark.storagelevel import StorageLevel
from pyspark.sql.types import StructField, StructType, StringType

from pyspark.streaming import StreamingContext   #sparkstreaming
from pyspark.streaming.kafka import KafkaUtils

import warnings
warnings.filterwarnings("ignore")

# cluster模式
conf = SparkConf().setAppName('test')
conf.set(" "," ")    # 各种spark配置
conf.set("dfs.client.block.write.replace-datanode-on-failure.policy", "NEVER")
conf.set("dfs.client.block.write.replace-datanode-on-failure.enable", True)
# ......
sc = SparkContext(conf=conf)
......

参考：

https://blog.csdn.net/xwc35047/article/details/53933265

https://jingyan.baidu.com/article/375c8e1971d00864f3a22902.html

https://blog.csdn.net/Gavinmiaoc/article/details/80527717

http://www.bubuko.com/infodetail-1789319.html

https://blog.csdn.net/caiandyong/article/details/44730031

posted on 2020-03-28 19:09 落日峡谷阅读(1255) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

落日峡谷

pyspark：'PipelinedRDD' object does not support indexing、 Initial job has not accepted any resources、IOException not a file: hdfs:// XXXX java.sql、Failed to replace a bad datanode on the existing

公告

导航