yarn RM crash问题一例
生活随笔
收集整理的這篇文章主要介紹了
yarn RM crash问题一例
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
今天收到線上的resource manager報警:
報錯信息如下:
2014-07-08?13:22:54,118?INFO?org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:?Expired:xxxx:53356?Timed?out?after?600?secs 2014-07-08?13:22:54,118?INFO?org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:?Deactivating?Node?xxxx:53356?as?it?is?now?LOST 2014-07-08?13:22:54,118?INFO?org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:?xxxx:53356?Node?Transitioned?from?UNHEALTHY?to?LOST 2014-07-08?13:22:54,118?FATAL?org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:?Error?in?handling?event?type?NODE_REMOVED?to?the?scheduler java.lang.NullPointerExceptionat?org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:715)at?org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:974)at?org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:108)at?org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:378)at?java.lang.Thread.run(Thread.java:662) 2014-07-08?13:22:54,118?INFO?org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:?Exiting,?bbye.. 2014-07-08?13:22:54,119?INFO?org.apache.hadoop.yarn.event.AsyncDispatcher:?Size?of?event-queue?is?1000 2014-07-08?13:22:54,119?INFO?org.apache.hadoop.yarn.event.AsyncDispatcher:?Size?of?event-queue?is?2000這是一個bug,bug id:https://issues.apache.org/jira/browse/YARN-502
根據(jù)bug的描述,是在rm刪除標記為UNHEALTHY的nm的時候可能會觸發(fā)bug(第一次已經(jīng)刪除,后面刪除再進行刪除操作時就會報錯)。
根據(jù)堆棧信息來看代碼:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceScheduler:protected?ResourceScheduler?scheduler;?private?final?class?EventProcessor?implements?Runnable?{?//?開啟一個EventProcessor?線程,對event進行處理@Overridepublic?void?run()?{SchedulerEvent?event;while?(!stopped?&&?!Thread.currentThread?().isInterrupted())?{try?{event?=?eventQueue.take();??//?從event?queue里面拿出event}?catch?(InterruptedException?e)?{LOG.error("Returning,?interrupted?:?"?+?e);return;?//?TODO:?Kill?RM.}try?{scheduler.handle(event);?//處理event}?catch?(Throwable?t)?{?//?cache?event的異常//?An?error?occurred,?but?we?are?shutting?down?anyway.//?If?it?was?an?InterruptedException,?the?very?act?of//?shutdown?could?have?caused?it?and?is?probably?harmless.if?(stopped?)?{LOG.warn("Exception?during?shutdown:?"?,?t);break;}LOG.fatal("Error?in?handling?event?type?"?+?event.getType()?//根據(jù)日志來看,這里獲取的event.getType()為?NODE_REMOVED+?"?to?the?scheduler",?t);if?(shouldExitOnError&&?!ShutdownHookManager.get().isShutdownInProgress())?{LOG.info("Exiting,?bbye.."?);System.?exit(-1);}}}}}這里可以看到可以通過shouldExitOnError可以控制RM線程是否退出。
private?boolean?shouldExitOnError?=?false;?//?初始設置為false@Overridepublic?synchronized?void?init(Configuration?conf)?{??//?在做初始化時,可以通過配置文件獲取this.?shouldExitOnError?=conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR);?//?參數(shù)在Dispatcher類中定義super.init(conf);}org.apache.hadoop.yarn.event.Dispatcher類: public?interface?Dispatcher?{???//?Configuration?to?make?sure?dispatcher?crashes?but?doesn't?do?system-exit?in//?case?of?errors.?By?default,?it?should?be?false,?so?that?tests?are?not//?affected.?For?all?daemons?it?should?be?explicitly?set?to?true?so?that//?daemons?can?crash?instead?of?hanging?around.public?static?final?String?DISPATCHER_EXIT_ON_ERROR_KEY?="yarn.dispatcher.exit-on-error";?//?控制參數(shù)public?static?final?boolean?DEFAULT_DISPATCHER_EXIT_ON_ERROR?=?false;?//?默認為falseEventHandler?getEventHandler();void?register(Class<??extends?Enum>?eventType,?EventHandler?handler); }在ResourceManager類的init函數(shù)中:
?@Overridepublic?synchronized?void?init(Configuration?conf)?{this.?conf?=?conf;this.?conf.setBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,?true);??//?這個值的默認值為true了(覆蓋了Dispatcher類中的DEFAULT設置)即默認在遇到dispather的錯誤時,會退出。
遇到錯誤是否退出可以由配置參數(shù)yarn.dispatcher.exit-on-error決定。不過這個改動影響比較大,最好還是不要設置,還是打patch來解決吧。
官方的patch也比較簡單,即在rmnm時進行一次判斷,防止二次刪除操作:
---?hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java +++?hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java @@?-501,8?+501,13?@@?public?DeactivateNodeTransition(NodeState?finalState)?{public?void?transition(RMNodeImpl?rmNode,?RMNodeEvent?event)?{//?Inform?the?schedulerrmNode.nodeUpdateQueue.clear(); -??????rmNode.context.getDispatcher().getEventHandler().handle( -??????????new?NodeRemovedSchedulerEvent(rmNode)); +??????//?If?the?current?state?is?NodeState.UNHEALTHY +??????//?Then?node?is?already?been?removed?from?the +??????//?Scheduler +??????if?(!rmNode.getState().equals(NodeState.UNHEALTHY))?{ +????????rmNode.context.getDispatcher().getEventHandler() +??????????.handle(?new?NodeRemovedSchedulerEvent(rmNode)); +??????}rmNode.context.getDispatcher().getEventHandler().handle(new?NodesListManagerEvent(NodesListManagerEventType.NODE_UNUSABLE,?rmNode));轉(zhuǎn)載于:https://blog.51cto.com/caiguangguang/1436087
總結
以上是生活随笔為你收集整理的yarn RM crash问题一例的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 普及什么是索引。
- 下一篇: 2 HTML中的body和它的默认样式