Hadoop常见异常及其解决方案
1、Shell$ExitCodeException
現象:運行hadoop job時出現如下異常:
14/07/09 14:42:50 INFO mapreduce.Job: Task Id : attempt_1404886826875_0007_m_000000_1, Status : FAILED
Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:?
org.apache.hadoop.util.Shell$ExitCodeException:?
? ? ? ? at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
? ? ? ? at org.apache.hadoop.util.Shell.run(Shell.java:418)
? ? ? ? at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
? ? ? ? at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
? ? ? ? at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
? ? ? ? at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
? ? ? ? at java.util.concurrent.FutureTask.run(FutureTask.java:262)
? ? ? ? at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
? ? ? ? at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
? ? ? ? at java.lang.Thread.run(Thread.java:744)
Container exited with a non-zero exit code 1
原因及解決辦法:原因未知。重啟可恢復正常
2、libhadoop.so.1.0.0 which might have disabled stack guard
現象:Hadoop 2.2.0 - warning: You have loaded library /home/hadoop/2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard.
原因及解決方法: 在/etc/profile中添加: #hadoop configurationexport PATH=$PATH:/home/jediael/hadoop-2.4.1/bin:/home/jediael/hadoop-2.4.1/sbin
export HADOOP_HOME=/home/jediael/hadoop-2.4.1
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
此警告出現的原因是最后2項未添加。
3、Retrying connect to server: master166/10.252.48.166:9000. Already tried 0 time(s)
在datanode上執行hdfs相關命令時,出現以下錯誤:
[jediael@slave156 ~]$ hadoop fs -ls /
14/08/31 15:00:37 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/08/31 15:00:38 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/08/31 15:00:39 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/08/31 15:00:40 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/08/31 15:00:41 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/08/31 15:00:42 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/08/31 15:00:43 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/08/31 15:00:44 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/08/31 15:00:45 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/08/31 15:00:46 INFO ipc.Client: Retrying connect to server: master166/10.252.48.166:9000. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
ls: Call to master166/10.252.48.166:9000 failed on connection exception: java.net.ConnectException: Connection refused
出現以上錯誤,通常都是由于datanode無法連接到namenode所致,以下是一種情況:
/etc/hosts中存在127.0.0.1 *****的配置,如
127.0.0.1 localhost
將這些配置去掉,然后重新格式化namenode,并重啟hadoop進程即可解決。
或者是以下原因:
hadoop安裝完成后,必須要用haddop namenode format格式化后,才能使用,如果重啟機器
在啟動hadoop后,用hadoop fs -ls命令老是報 10/09/25 18:35:29 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s).的錯誤,
用jps命令,也看不不到namenode的進程, 必須再用命令hadoop namenode format格式化后,才能再使用
??? 原因是:hadoop默認配置是把一些tmp文件放在/tmp目錄下,重啟系統后,tmp目錄下的東西被清除,所以報錯
??? 解決方法:在conf/core-site.xml 中增加以下內容
?? <property>
?? <name>hadoop.tmp.dir</name>
?? <value>/var/log/hadoop/tmp</value>
? <description>A base for other temporary directories</description>
? </property>
? 重啟hadoop后,格式化namenode即可
4、Permission denied: user=liaoliuqing, access=WRITE, inode="":jediael:supergroup:rwxr-xr-x
原因為用戶權限不足,能能訪寫HDFS中的文件。
解決方案:
關閉hadoop權限,在hdfs-site.xml文件中添加
<property> ? ?
<name>dfs.permissions</name> ? ?
<value>false</value> ? ?
</property>
5、Incompatible namespaceIDs
2015-02-02 15:10:57,526 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2015-02-02 15:10:57,543 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
2015-02-02 15:10:57,543 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2015-02-02 15:10:57,544 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started
2015-02-02 15:10:57,699 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
2015-02-02 15:10:58,090 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /mnt/tmphadoop/dfs/data: namenode namespaceID = 2017454015; datanode namespaceID = 1238467850
??????? at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)
??????? at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)
??????? at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:414)
??????? at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:321)
??????? at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1712)
??????? at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1651)
??????? at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1669)
??????? at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1795)
??????? at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1812)
問題原因:
每次namenode format會重新創建一個namenodeId,而${hadoop.tmp.dir}/dfs/data下包含了上次format下的id,當重新執行namenode format時清空了namenode下的數據,但是沒有清空datanode下的數據,所以造成namenode節點上的namespaceID與 datanode節點上的namespaceID不一致,從而導致從現上述異常,啟動失敗。
解決辦法:
(1)停止hadoop
?stop-all.sh
(2)在各個slave中刪除dfs.data.dir中的內容。若此屬性未修改,則其默認值為
<property>
? <name>${dfs.data.dir}</name>
? <value>${hadoop.tmp.dir}/dfs/data</value>
? <description>Determines where on the local filesystem an DFS data node
? should store its blocks.? If this is a comma-delimited
? list of directories, then data will be stored in all named
? directories, typically on different devices.
? Directories that do not exist are ignored.
? </description>
</property>
(3)重新格式化namenode
hadoop namenode -format
然后start-all.sh啟動hadoop
以上解決辦法需要將原有數據刪除,若數據不能刪除,則使用以下方法之一:
(1)修改${dfs.data.dir}/current/VERSION文件,將datanode中的id改成與namenode中的id一致。
(2)修改${dfs.data.dir}
總結
以上是生活随笔為你收集整理的Hadoop常见异常及其解决方案的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Hadoop2.4.1入门实例:MaxT
- 下一篇: 设置secureCRT中vim的字体颜色