ResourceManager掉线问题

解决YARN宕机,RM掉线问题

YARN

问题

长假回来发现YARN宕机了,查看情况后发现是2017年10月6日早上6点21分14秒左右发生。排查日志情况如下:


  • yarn-hadoop-resourcemanager.out日志
    1
    2
    3
    4
    Sep 01, 2017 5:51:27 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
    INFO: Binding org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices to GuiceManagedComponentProvider with the scope "Singleton"
    Halting due to Out Of Memory Error...
    Halting due to Out Of Memory Error...

  • yarn-hadoop-resourcemanager.log日志
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    2017-10-06 06:20:43,594 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1504259398577_4718_000001 State change from ALLOCATED to LAUNCHED
    2017-10-06 06:20:43,995 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1504259398577_4718_01_000001 Container Transitioned from ACQUIRED to RUNNING
    2017-10-06 06:20:56,990 WARN org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 10945ms for sessionid 0x15b4212b1f90dec
    2017-10-06 06:21:02,912 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 10945ms for sessionid 0x15b4212b1f90dec, closing socket connection and attempting reconnect
    2017-10-06 06:21:14,602 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
    2017-10-06 06:21:14,603 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session disconnected
    2017-10-06 06:21:15,668 ERROR org.mortbay.log: Error for /ws/v1/cluster/metrics
    java.lang.OutOfMemoryError: Java heap space
    at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
    at org.mortbay.io.nio.IndirectNIOBuffer.<init>(IndirectNIOBuffer.java:28)
    at org.mortbay.jetty.nio.AbstractNIOConnector.newBuffer(AbstractNIOConnector.java:69)
    at org.mortbay.jetty.AbstractBuffers.getBuffer(AbstractBuffers.java:75)
    at org.mortbay.jetty.HttpGenerator.completeHeader(HttpGenerator.java:281)
    at org.mortbay.jetty.HttpConnection.commitResponse(HttpConnection.java:632)
    at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1011)
    at com.sun.jersey.spi.container.servlet.WebComponent$Writer.flush(WebComponent.java:315)
    at java.util.zip.DeflaterOutputStream.flush(DeflaterOutputStream.java:282)
    at com.sun.jersey.spi.container.ContainerResponse$CommittingOutputStream.flush(ContainerResponse.java:145)
    at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297)
    at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
    at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
    at org.codehaus.jackson.impl.WriterBasedGenerator.flush(WriterBasedGenerator.java:906)
    at com.sun.jersey.json.impl.writer.JacksonStringMergingGenerator.flush(JacksonStringMergingGenerator.java:250)
    at com.sun.jersey.json.impl.writer.Stax2JacksonWriter.flush(Stax2JacksonWriter.java:360)
    at com.sun.xml.bind.v2.runtime.output.XMLStreamWriterOutput.endDocument(XMLStreamWriterOutput.java:112)
    at com.sun.xml.bind.v2.runtime.XMLSerializer.endDocument(XMLSerializer.java:856)
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.postwrite(MarshallerImpl.java:374)
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:321)
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:177)
    at com.sun.jersey.json.impl.BaseJSONMarshaller.marshallToJSON(BaseJSONMarshaller.java:103)
    at com.sun.jersey.json.impl.provider.entity.JSONRootElementProvider.writeTo(JSONRootElementProvider.java:136)
    at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:157)
    at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
    at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
    at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
    2017-10-06 06:21:36,220 ERROR org.mortbay.log: Error for /ws/v1/cluster/metrics
    java.lang.OutOfMemoryError: Java heap space
    at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
    at org.mortbay.io.nio.IndirectNIOBuffer.<init>(IndirectNIOBuffer.java:28)
    at org.mortbay.jetty.nio.AbstractNIOConnector.newBuffer(AbstractNIOConnector.java:69)
    at org.mortbay.io.nio.IndirectNIOBuffer.<init>(IndirectNIOBuffer.java:28)
    at org.mortbay.jetty.nio.AbstractNIOConnector.newBuffer(AbstractNIOConnector.java:69)
    at org.mortbay.jetty.AbstractBuffers.getBuffer(AbstractBuffers.java:75)
    at org.mortbay.jetty.HttpGenerator.completeHeader(HttpGenerator.java:281)
    at org.mortbay.jetty.HttpConnection.commitResponse(HttpConnection.java:632)
    at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1011)
    at com.sun.jersey.spi.container.servlet.WebComponent$Writer.flush(WebComponent.java:315)
    at java.util.zip.DeflaterOutputStream.flush(DeflaterOutputStream.java:282)
    at com.sun.jersey.spi.container.ContainerResponse$CommittingOutputStream.flush(ContainerResponse.java:145)
    at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297)
    at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
    at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
    at org.codehaus.jackson.impl.WriterBasedGenerator.flush(WriterBasedGenerator.java:906)
    at com.sun.jersey.json.impl.writer.JacksonStringMergingGenerator.flush(JacksonStringMergingGenerator.java:250)
    at com.sun.jersey.json.impl.writer.Stax2JacksonWriter.flush(Stax2JacksonWriter.java:360)
    at com.sun.xml.bind.v2.runtime.output.XMLStreamWriterOutput.endDocument(XMLStreamWriterOutput.java:112)
    at com.sun.xml.bind.v2.runtime.XMLSerializer.endDocument(XMLSerializer.java:856)
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.postwrite(MarshallerImpl.java:374)
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:321)
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:177)
    at com.sun.jersey.json.impl.BaseJSONMarshaller.marshallToJSON(BaseJSONMarshaller.java:103)
    at com.sun.jersey.json.impl.provider.entity.JSONRootElementProvider.writeTo(JSONRootElementProvider.java:136)
    at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:157)
    at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
    at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
    at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
    2017-10-06 06:21:23,419 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1504259398577_4718_000001 (auth:SIMPLE)
    2017-10-06 06:21:17,929 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server DSJHW02/10.200.0.12:2181. Will not attempt to authenticate using SASL (unknown error)
    2017-10-06 06:21:15,668 ERROR org.mortbay.log: Error for /ws/v1/cluster/info
    2017-10-06 06:21:15,668 ERROR org.mortbay.log: handle failed
    2017-10-06 06:23:48,458 ERROR org.mortbay.log: /ws/v1/cluster/metrics
    2017-10-06 06:23:42,576 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Socket Reader #1 for port 8032,5,main] threw an Error. Shutting down now...
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:23:40,398 ERROR org.mortbay.log: /ws/v1/cluster/info
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:23:30,739 ERROR org.mortbay.log: handle failed
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:23:26,642 ERROR org.mortbay.log: handle failed
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:23:05,876 ERROR org.mortbay.log: handle failed
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:23:02,304 ERROR org.mortbay.log: handle failed
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:22:56,733 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1504259398577_4718_000001
    2017-10-06 06:22:44,696 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to DSJHW02/10.200.0.12:2181, initiating session
    2017-10-06 06:24:09,330 ERROR org.mortbay.log: /ws/v1/cluster/metrics
    2017-10-06 06:23:57,109 ERROR org.mortbay.log: handle failed
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:23:53,689 ERROR org.mortbay.log: handle failed
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:23:53,689 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[156303910@qtp-1582028874-4794,5,main] threw an Error. Shutting down now...
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:24:35,613 ERROR org.mortbay.log: handle failed
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:24:33,523 WARN org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 114256ms for sessionid 0x15b4212b1f90dec
    2017-10-06 06:24:31,263 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1504259398577_4718_000001 State change from LAUNCHED to RUNNING
    2017-10-06 06:24:31,263 ERROR org.mortbay.log: handle failed
    java.lang.OutOfMemoryError: Java heap space
    2017-10-06 06:24:30,983 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop IP=10.200.0.15 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1504259398577_4718 APPATTEMPTID=appattempt_1504259398577_4718_000001
    2017-10-06 06:24:48,273 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1504259398577_4718 State change from ACCEPTED to RUNNING
    2017-10-06 06:24:46,468 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 114256ms for sessionid 0x15b4212b1f90dec, closing socket connection and attempting reconnect
    2017-10-06 06:24:50,793 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException
    2017-10-06 06:24:50,793 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException

从以上日志可以发现,zookeeper服务端由于会话0x15b4212b1f90dec10秒超时,失去连接。RMRecovery服务尝试恢复会话,但是失败。接下来发生java.lang.OutOfMemoryError: Java heap space的错误,YARN宕机。


  • zookeeper.out日志
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    2017-10-06 04:33:54,882 [myid:2] - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@489] - Processed session termination for sessionid: 0x15e4fe31e0304fe
    2017-10-06 04:34:09,519 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@489] - Processed session termination for sessionid: 0x15e4fe31e0304ff
    2017-10-06 04:34:23,880 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@489] - Processed session termination for sessionid: 0x15e4fe31e030500
    2017-10-06 05:20:34,599 [myid:2] - INFO [SyncThread:2:FileTxnLog@199] - Creating new log file: log.2900519a0e
    2017-10-06 05:20:34,599 [myid:2] - INFO [Snapshot Thread:FileTxnSnapLog@240] - Snapshotting: 0x2900519a0c to /u01/app/zookeeper/data/version-2/snapshot.2900519a0c
    2017-10-06 06:21:02,000 [myid:2] - INFO [SessionTracker:ZooKeeperServer@355] - Expiring session 0x15b4212b1f90dec, timeout of 10000ms exceeded
    2017-10-06 06:21:02,001 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@489] - Processed session termination for sessionid: 0x15b4212b1f90dec
    2017-10-06 06:22:36,820 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /10.200.0.11:51656
    2017-10-06 06:24:33,523 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@893] - Client attempting to renew session 0x15b4212b1f90dec at /10.200.0.11:51656
    2017-10-06 06:24:33,523 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@638] - Invalid session 0x15b4212b1f90dec for client /10.200.0.11:51656, probably expired
    2017-10-06 06:24:33,524 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /10.200.0.11:51656 which had sessionid 0x15b4212b1f90dec
    2017-10-06 18:24:32,243 [myid:2] - INFO [Snapshot Thread:FileTxnSnapLog@240] - Snapshotting: 0x290052be9e to /u01/app/zookeeper/data/version-2/snapshot.290052be9e
    2017-10-06 18:24:32,708 [myid:2] - INFO [SyncThread:2:FileTxnLog@199] - Creating new log file: log.290052bea0
    2017-10-07 08:13:47,700 [myid:2] - INFO [Snapshot Thread:FileTxnSnapLog@240] - Snapshotting: 0x290053f411 to /u01/app/zookeeper/data/version-2/snapshot.290053f411
    2017-10-07 08:13:48,599 [myid:2] - INFO [SyncThread:2:FileTxnLog@199] - Creating new log file: log.290053f413
    2017-10-07 23:48:30,849 [myid:2] - INFO [Snapshot Thread:FileTxnSnapLog@240] - Snapshotting: 0x29005550c7 to /u01/app/zookeeper/data/version-2/snapshot.29005550c7
    2017-10-07 23:48:31,535 [myid:2] - INFO [SyncThread:2:FileTxnLog@199] - Creating new log file: log.29005550c9

从zookeeper的日志中可以发现,zookeeper的SessionTracker由于10秒钟超时,在2017-10-06 06:21:02主动释放了会话0x15b4212b1f90dec。其后,在2017-10-06 06:24:33时客户端尝试重新恢复会话时,便被认为该会话不合法并关闭连接。


资料

在这个博客下也遇到了类似的问题:7、hb下面的regionserver全部掉线
其解决方式是提高datanode节点间的最大传输数dfs.datanode.max.transfer.threads
该参数默认为4096,目前值为8192,调优文档建议设置在4万多,集群所有节点需一致否则会产生负载不均衡。


解决

查看yarn-site.xml:

1
2
3
4
<property>
<name>yarn.resourcemanager.zk-state-store.address</name>
<value>DSJHW01:2181,DSJHW02:2181,DSJHW03:2181</value>
</property>

发现存在zookeeper配置,而缺失yarn.resourcemanager.ha.enabled的配置。
由此可以推断出,即使RM HA没有做的情况下,如果yarn-site中有zookeeper配置则会尝试连接和切换。因此加入RM HA即可解决。

同时注意到,dfs.datanode.max.transfer.threads(相当于linux中的文件句柄)参数设置过低,增加其值;同时hive中JVM的Java heap space过低,应适当增加。


如果文章对您有帮助,感谢您的赞助支持!