This bug will kill supervisors

Affects Version/s: 0.9.2-incubating, 0.9.3, 0.9.4

Fix Version/s: 0.10.0, 0.9.5

问题背景

最近发现刚搭起的Storm集群,没过多久,Supervisor 便悄然死去了一大半。查看死去Supervisor的log,发现java.io.FileNotFoundException: File '../stormconf.ser' does not exist异常。网上给出的答案大多是

将 { storm.local.dir } 目录下的文件清空,重启就好了。

但这是指标不治本,即时重启可以跑起来,可是为什么会出现这个问题,依然不知道。

然后才发现线STORM-130解决了这个问题。该问题的重现场景:

1) Run a storm cluster with atleast 2 supervisors with 4 slots each
2) Deploy a topology that uses 4 workers, topology will be distributed with each supervisor having two workers each
3) kill one of the supervisor lets say supervisor1 
4) wait till topology re-balances to occupy 4 workers on supervisor2
5) now bring up supervisor1, It goes through the cycle of cleaning up old topology code
6) nimbus re-balances topology which triggers supervisor.sync-process method
7) sync-process tries to launch a worker for the topology whose code data is delete when the supervisor started causing it throw up following exception

问题原因

上面场景分析提到的 sync-process是supervisor运行的一个函数。Supervisor会在后台运行这两个函数:

  • synchronize-supervisor: This is called whenever assignments in Zookeeper change and also every 10 seconds.

    • Downloads code from Nimbus for topologies assigned to this machine for which it doesn't have the code yet.
    • Writes into local filesystem what this node is supposed to be running. It writes a map from port -> LocalAssignment. LocalAssignment contains a topology id as well as the list of task ids for that worker.
  • sync-processes: Reads from the LFS what synchronize-supervisor wrote and compares that to what's actually running on the machine. It then starts/stops worker processes as necessary to synchronize.

从描述中可以看出,synchronized-supervisor 和 sync-process 两个函数是通过 LFS 进行同步。The key reason is "synchronize-supervisor" which responsible for download file and remove file thread and "sync-processes" which responsible for start worker process thread is Asynchronous.

in synchronize-supervisor read assigment information from zk, supervisor download necessary file from nimbus and write local state. In aother thread sync-processes funciton read local state to launch workor process, when the worker process has not start ,synchronize-supervisor function is called again topology's assignment information has changed (cased by rebalance,or worker time out etc) worker assignment to this supervisor has move to another supervisor, synchronize-supervisor remove the unnecessary file (jar file and ser file etc.) , after this, worker launched by " sync-processes" ,ser file was not exsit , this issue occur.

可能解决办法

  • 换一个storm
  • 调整参数
    • Change "synchronize-supervisor" thread loop time to a longger than 10(default time) sec, such as 30 sec。
    • supervisor.worker.timeout.secs: 30 -> 5

References:

  • https://issues.apache.org/jira/browse/STORM-130
  • http://storm.apache.org/documentation/Lifecycle-of-a-topology.html

[Storm] java.io.FileNotFoundException: File '../stormconf.ser' does not exist的更多相关文章

  1. Spark启动报错|java.io.FileNotFoundException: File does not exist: hdfs://hadoop101:9000/directory

    at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:) at org.a ...

  2. com.jcraft.jsch.JSchException: java.io.FileNotFoundException: file:\D:\development\ideaProjects\salary-card\target\salary-card-0.0.1-SNAPSHOT.jar!\BOOT-INF\classes!\keystore\login_id_rsa 资源未找到

    com.jcraft.jsch.JSchException: java.io.FileNotFoundException: file:\D:\development\ideaProjects\sala ...

  3. 关于spark入门报错 java.io.FileNotFoundException: File file:/home/dummy/spark_log/file1.txt does not exist

    不想看废话的可以直接拉到最底看总结 废话开始: master: master主机存在文件,却报 执行spark-shell语句:  ./spark-shell  --master spark://ma ...

  4. Sqoop 抽数报错: java.io.FileNotFoundException: File does not exist

    Sqoop 抽数报错: java.io.FileNotFoundException: File does not exist 一.错误详情 2019-10-17 20:04:49,080 INFO [ ...

  5. Diagnostics: File file:/private/tmp/spark-d4ebd819-e623-47c3-b008-2a4df8019758/__spark_libs__6824092999244734377.zip does not exist java.io.FileNotFoundException: File file:/private/tmp/spark-d4ebd819

    spark伪分布式模式 on-yarn出现一下错误 Diagnostics: File file:/private/tmp/spark-d4ebd819-e623-47c3-b008-2a4df801 ...

  6. QA:java.lang.RuntimeException:java.io.FileNotFoundException:Resource nexus-maven-repository-index.properties does not exist.

    QA:java.lang.RuntimeException:java.io.FileNotFoundException:Resource nexus-maven-repository-index.pr ...

  7. java.io.FileNotFoundException:file:\D:\code\xml-load\target\XX.jar!\XXX(文件名、目录名或卷标语法不正确。)

    1.当使用Spring Boot将应用打成jar时,需要读取resources目录下配置文件时,通常使用ClassLoader直接读取,通常建议使用这种方式,直接将xml文件读成流传入 // 加载xm ...

  8. java.io.FileNotFoundException:SESSIONS.ser (系统找不到指定的路径。)

    问题如下: java.io.FileNotFoundException: E:\apache-tomcat-8.0.37\work\Catalina\localhost\20161013Shoppin ...

  9. Caused by: java.io.FileNotFoundException: velocity.log (No such file or directory)

    Caused by: org.apache.velocity.exception.VelocityException: Error initializing log: Failed to initia ...

随机推荐

  1. ofo走出校园观察:市场定位导致产品错位?

    Ofo和摩拜单车虽然同样都是做单车共享,但实际上两者在最初的市场定位是有明显的差异的,因此提供的产品方案也存在巨大的差异. 市场定位不同,导致产品方案的巨大差异 摩拜单车一开始就定位于开放市场,充分的 ...

  2. Git分支学习简记

    简介 开始过了两遍Git的内容,第二天就已经忘记了分支(branch)的概念,开始还觉得不太用的到.然后又看了第二遍,才发现为什么大家说这个是Git里边极其重要的一个东西. 所谓branch,就类似于 ...

  3. 【Beta版本】冲刺-Day7

    队伍:606notconnected 会议时间:12月15日 目录 一.行与思 二.站立式会议图片 三.燃尽图 四.代码Check-in 一.行与思 张斯巍(433) 今日进展:修改界面,应用图标 明 ...

  4. linux学习基础6之sed用法详解

    1 sed 又称为流编辑器,它逐行将文本文件中的行读取到模式空间中间去,将符合编辑条件的行进行编辑后输出到显示器上来.默认sed不编辑原文件只处理模式空间中的内容. 2 sed用法 sed [opti ...

  5. BZOJ1915: [Usaco2010 Open]奶牛的跳格子游戏

    权限题,没有传送门. 这很显然是一道DP题,刚看完题目可能会比较懵逼.这道题如果不要求回去,那么就是一道很裸的DP题.但是本题要求回去而且回去的格子的前一个格必须是之前经过的. 先不考虑回去的路程,对 ...

  6. SaltStack与ZeroMQ(二)

    SaltStack与ZeroMQ SaltStack底层是基于ZeroMQ进行高效的网络通信. ZeroMQ简介 ØMQ (也拼写作ZeroMQ,0MQ或ZMQ)是一个为可伸缩的分布式或并发应用程序设 ...

  7. [转]图片中的字符分割提取(基于opencv)

    http://blog.csdn.net/anqing715/article/details/16883863 源图片 像这些图片的字符就比较好操作,每个字符都独立,不连在一起,所以轮廓检测最好了.所 ...

  8. Instant Python 中文缩减版

    前言 本文主要来自<Python基础教程(第2版)>([挪]Magnus Lie Hetland著,司维 曾军崴 谭颖华译 人民邮电出版社) 中的“附录A 简明版本”,对于其中的有问题之处 ...

  9. [转发]Linux的系统调用宏

    原来在linux/include/linux/syscalls.h 中定义了如下的宏: 复制代码#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1 ...

  10. 在JavaScript中,arguments是对象的一个特殊属性。

    arguments对象 function函数的内置参数的"数组"/"集合":同时arguments对象就像数组,但是它却不是数组. 常用属性: 1.length ...