报错背景
Linux 环境,普通用户。
Flink任务提交到Yarn集群上执行,发现Yarn能够成功分配资源,但是任务始终处于ACCEPTE状态,不能执行,查看Yarn日志后发现几乎没有报错日志,任务等待一定时间之后直接退出,并没有报告明显错误。
报错现象
查看Yarn WEB界面:http://bigdata1:8088/cluster/app/application_xxx,发现以下信息
Application Attempt State:FAILED Started:Fri Oct 14 09:28:16 +0800 2022 Elapsed:3sec AM Container:container_1665709915900_0003_02_000001 Node:N/A Tracking URL:History Diagnostics Info:AM Container for appattempt_1665709915900_0003_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2022-10-14 09:28:19.937]Exception from container-launch.Container
id: container_1665709915900_0003_02_000001Exit code: 1[2022-10-14 09:28:19.939]Container exited with a non-zero exit code 1.
Error file: prelaunch.err.Last 4096 bytes of prelaunch.err :[2022-10-14 09:28:19.940]Container exited with a non-zero exit code 1.
Error file: prelaunch.err.Last 4096 bytes of prelaunch.err :For more detailed output, check the application tracking
page: http://bigdata:8088/cluster/app/application_1665709915900_0003 Then click on links to logs of each attempt.
报错原因
环境变量的原因。
之前测试环境权限都是放开的,部署的Hadoop正常启动是没问题的,但是这次用的普通用户,权限的限制非常苛刻,猜测是外部用户权限限制导致hadoop在运行过程中内部调用时找不到相关环境变量。
报错解决
在yarn-site.xml中添加环境变量
<property> <name>yarn.application.classpath</name> <value>/opt/app/hadoop/etc/hadoop:/opt/app/hadoop/share/hadoop/common/lib/*:/opt/app/hadoop/share/hadoop/common/*:/opt/app/hadoop/share/hadoop/hdfs:/opt/app/hadoop/share/hadoop/hdfs/lib/*:/opt/app/hadoop/share/hadoop/hdfs/*:/opt/app/hadoop/share/hadoop/yarn/lib/*:/opt/app/hadoop/share/hadoop/yarn/*:/opt/app/hadoop/share/hadoop/mapreduce/lib/*:/opt/app/hadoop/share/hadoop/mapreduce/*</value> </property>
重启Hadoop即可。