背景:

  1、已禁用ipv6。

  2、所有节点的/etc/hosts正确配置,任务在ResourceManager提交。

  3、yarn-site.xml中指定了

    yarn.resourcemanager.hostname=Master
    yarn.nodemanager.aux-services=mapreduce_shuffle
    并在各NodeManager配置了相应的yarn.nodemanager.hostname 4、mapred-site.xml中指定了mapreduce.framework.name=yarn

现象:

  提交MR任务的连接拒绝的堆栈,其中连接的container地址为localhost,与实际需要的不一致。

ser: root
Name: Bigdata-Hadoop-1.0-SNAPSHOT.jar
Application Type: MAPREDUCE
Application Tags:  
YarnApplicationState: FAILED
Queue: default
FinalStatus Reported by AM: FAILED
Started: Thu Nov 22 21:59:31 +0800 2018
Elapsed: 6mins, 1sec
Tracking URL: History
Diagnostics:
Application application_1542889591013_0006 failed 2 times due to Error launching appattempt_1542889591013_0006_000002. Got exception: java.net.ConnectException: Call From localhost/127.0.0.1 to localhost:33070 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy83.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy84.startContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:250)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:713)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
at org.apache.hadoop.ipc.Client.call(Client.java:1452)
... 15 more
. Failing the application.

同时在底部的两次尝试时,driver地址也为localhost

通过查询发现yarn返回的集群节点信息中,所有的NodeManager地址均为localhost。

以上均证实通过yarn查询到的NodeManager地址异常,无法远程调用NodeManager来启动Container,直接导致MR任务失败。

方案:

  1、四方博客,撸遍全网,无果。

  2、游走各群,虚心请教,无果。

  3、自力更生,强撸源码,待续 ... ...

源码:

  找不到入口就别看了。

  org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java:252

  @GET
@Path("/nodes")
@Produces({ MediaType.APPLICATION_JSON, MediaType.APPLICATION_XML })
public NodesInfo getNodes(@QueryParam("states") String states) {
init();
ResourceScheduler sched = this.rm.getResourceScheduler();
if (sched == null) {
throw new NotFoundException("Null ResourceScheduler instance");
} EnumSet<NodeState> acceptedStates;
if (states == null) {
acceptedStates = EnumSet.allOf(NodeState.class);
} else {
acceptedStates = EnumSet.noneOf(NodeState.class);
for (String stateStr : states.split(",")) {
acceptedStates.add(
NodeState.valueOf(StringUtils.toUpperCase(stateStr)));
}
} Collection<RMNode> rmNodes = RMServerUtils.queryRMNodes(this.rm.getRMContext(),
acceptedStates);
NodesInfo nodesInfo = new NodesInfo();
for (RMNode rmNode : rmNodes) {
NodeInfo nodeInfo = new NodeInfo(rmNode, sched);
if (EnumSet.of(NodeState.LOST, NodeState.DECOMMISSIONED, NodeState.REBOOTED)
.contains(rmNode.getState())) {
nodeInfo.setNodeHTTPAddress(EMPTY);
}
nodesInfo.add(nodeInfo);
} return nodesInfo;
}

  这里在生成的节点信息。

  org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeInfo.java:57

public NodeInfo(RMNode ni, ResourceScheduler sched) {
NodeId id = ni.getNodeID();
SchedulerNodeReport report = sched.getNodeReport(id);
this.numContainers = 0;
this.usedMemoryMB = 0;
this.availMemoryMB = 0;
if (report != null) {
this.numContainers = report.getNumContainers();
this.usedMemoryMB = report.getUsedResource().getMemory();
this.availMemoryMB = report.getAvailableResource().getMemory();
this.usedVirtualCores = report.getUsedResource().getVirtualCores();
this.availableVirtualCores = report.getAvailableResource().getVirtualCores();
}
this.id = id.toString();
this.rack = ni.getRackName();
this.nodeHostName = ni.getHostName();
this.state = ni.getState();
this.nodeHTTPAddress = ni.getHttpAddress();
this.lastHealthUpdate = ni.getLastHealthReportTime();
this.healthReport = String.valueOf(ni.getHealthReport());

  三个关键信息全是ni这个怪胎来的,那就看你怎么来的行不。

  org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java:63

 public static List<RMNode> queryRMNodes(RMContext context,
EnumSet<NodeState> acceptedStates) {
// nodes contains nodes that are NEW, RUNNING OR UNHEALTHY
ArrayList<RMNode> results = new ArrayList<RMNode>();
if (acceptedStates.contains(NodeState.NEW) ||
acceptedStates.contains(NodeState.RUNNING) ||
acceptedStates.contains(NodeState.UNHEALTHY)) {
for (RMNode rmNode : context.getRMNodes().values()) {
if (acceptedStates.contains(rmNode.getState())) {
results.add(rmNode);
}
}
}

  看来这个context里有点东西,具体怎么初始化这个context下回再研究,先看里面对RMNodes的操作。

  接下的时间里就是在跟Yarn挣扎,但是事实证明并不能找到这个hostname究竟是怎么成了localhost,而不是期望的工作节的hostname。毕竟代码量不少,里面错综复杂,还需要点时间缕缕,那就下次接着看源码。不过在了解了一定原理后,搂一遍源码确实对理解原理还是蛮有效的。

  虽然看源码没有得到想要的结果,但是有个大胆想法:通过IP解析hostname是取hosts文件里IP匹配上的第一个hostname(待确认)。因此就将工作节点的ip和hostname挪到第一行,重启yarn集群,MR任务瞬间畅通。

yarn查询/cluster/nodes均返回localhost的更多相关文章

  1. 查询oracle数据库,返回的数据是乱码。 PL/SQL正常。

    查询oracle数据库,返回的数据是乱码. PL/SQL正常. 解决方案如下:

  2. Mybatis按SQL查询字段的顺序返回查询结果

    在SpringMVC+Mybatis的开发过程中,可以通过指定resultType="hashmap"来获得查询结果,但其输出是没有顺序的.如果要按照SQL查询字段的顺序返回查询结 ...

  3. [ERR] Node 172.168.63.202:7001 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or contains some

    关于启动redis集群时: [ERR] Node 172.168.63.202:7001 is not empty. Either the nodealready knows other nodes ...

  4. Node 192.168.248.12:7001 is not empty. Either the node already knows other nodes (check with CLUSTER NODES) or contains some key in database 0.

    [root@node00 src]# ./redis-trib.rb add-node --slave --master-id4f6424e47a2275d2b7696bfbf8588e8c4c3a5 ...

  5. spark on yarn,cluster模式时,执行spark-submit命令后命令行日志和YARN AM日志

    [root@linux-node1 bin]# ./spark-submit \> --class com.kou.List2Hive \> --master yarn \> --d ...

  6. [ERR] Node 172.16.6.154:7002 is not empty. Either the node already knows other nodes (check with CLUSTER NODES) or contains some key in database 0.

    关于启动redis集群时: [ERR] Node 172.168.63.202:7001 is not empty. Either the nodealready knows other nodes ...

  7. 浏览器给openresty连接发送参数请求,查询数据库,并返回json数据

    nginx.conf配置文件 #user nobody; worker_processes 1; error_log logs/error.log; #error_log logs/error.log ...

  8. ecshop后台根据条件查询后不填充table 返回的json数据,content为空?

    做ecshop后台开发的时,根据条件查询后,利用ajax返回的content json数据内容为空,没有填充table 效果 预期效果 问题: make_json_result($smarty -&g ...

  9. Sql Server的艺术(六) SQL 子查询,创建使用返回多行的子查询,子查询创建视图

    子查询或内部查询或嵌套查询在另一个SQL查询的查询和嵌入式WHERE子句中. 子查询用于返回将被用于在主查询作为条件的数据,以进一步限制要检索的数据. 子查询可以在SELECT,INSERT,UPDA ...

随机推荐

  1. Linux文件系统备份

    1.添加一块硬盘——创建分区   fdisk   /dev/sdb    n   创建新分区      p  打印分区      w 保存   ——分区格式化    mkfs.xfs  /dev/sd ...

  2. 数据库alert报错:ORA-00202、ORA-15081、ORA-27072

    思路分析: 1.发现数据库宕机,检查alert日志发现如下出现控制文件:I/O错误 Thu Apr 11 06:40:14 2019WARNING: Read Failed. group:2 disk ...

  3. MySQL(基础技能)

    一.概述 1.什么是数据库 ? 答:数据的仓库,如:在ATM的示例中我们创建了一个 db 目录,称其为数据库 2.什么是 MySQL.Oracle.SQLite.Access.MS SQL Serve ...

  4. HDU - 3567

    https://cn.vjudge.net/problem/HDU-3567 #include <stdio.h>#include <math.h>#include <q ...

  5. python3 在文件确实存在的情况下,运行提示找不到文件

    提示 [Errno 2] No such file or directory: 但是路径下确实存在此文件,在不改动的情况下,再次运行,执行成功. 百思不得其解,看到此链接下的回答 http://bbs ...

  6. 监测c3动画过渡完成的事件

    监测css3动画完成的事件 transitionend 用法: let element = document.getElementById("slidingMenu"); elem ...

  7. MySQL8.0.x免安装配置

    目录 概述 下载 配置环境变量 编辑配置文件 初始化MySQL 安装MySQL系统(Windows)服务 初始化MySQL 启动MySQL 修改默认密码 开启远程登录 概述 MySQL从5.7一下子跳 ...

  8. win10 开发mfc 64位 ocx控件

    问题1.模块“XXX.ocx”加载失败 解决办法:项目--〉属性--〉常规-〉配置类型-〉  动态库(.dll) 修改为 静态库(.lib) 问题2.1>x64\Release\stdafx.o ...

  9. pyton 模块之 pysmb 文件上传(windows)

    #!/usr/bin/env python #coding:utf-8 from smb.SMBConnection import SMBConnection from nmb.NetBIOS imp ...

  10. C语言的三目运算符(x=a?b:c):条件运算符

    三目运算符使用是为了有条件判断的选择赋值 x = a ? b : c 先计算 a表达式 是否为真.若为真,x 的值便是 b表达式的值,否则 x的值便是 c表达式的值. 条件运算符是右结合的. 如:a ...