就以pdsp node3 down了为例,如下

==========================START=====================================

The Whole solution of a server down or instance crash

For example:

Alert comes like below:

Then we need to check this server status, also send mail to linux team Lst-Techops.DLRS@nike.com ,let them help check or start the server(most cases that server will start automatically).

Also send mail to APP team to inform them the server status (when server up and service online ,also need inform them and let them check APP status)

(Here is application DL for reference)

Below is some solutions and troubleshooting related to database:

Login ora-plus-p-1.va2.b2c.nike.com

Then find can’t connect, so it may went down

Then login another cluster node like ora-plus-p-1.va2.b2c.nike.com

check db status:

oracle@ora-plus-p-1:PDSP1:/u01/home/oracle $ srvctl status database -d PDSP

Instance PDSP1 is running on node ora-plus-p-1

Instance PDSP2 is running on node ora-plus-p-2

Instance PDSP3 is not running on node ora-plus-p-3

Instance PDSP4 is running on node ora-plus-p-4

After sometime or with the help of linux team, the host up.

Normally crs resource and db resource will auto-start with the server start.

Check crs resource or db status command:

sudo /u01/root/11.2.0.4/grid/bin/crsctl status res –t (execute this using your own nikeid)

srvctl status database –d db_name

make sure necessary resource is online.

if crs is not started,

sudo /u01/root/11.2.0.4/grid/bin/crsctl enable crs(so crs will auto-start when node reboot)

sudo /u01/root/11.2.0.4/grid/bin/crsctl start crs

Then check service whether need relocated.

For all of nike database ,services are recorded in this shared drive.

\\NKE-WIN-NAS-P21nike.com\DCIT_DBA\Dataguard\

And for this PDSP service part, the directory is

\\NKE-WIN-NAS-P21nike.com\DCIT_DBA\Dataguard\PDSP

First check service running node:

srvctl status service -d db_name

And then relocate services to the right node:

In this case,we do below:

srvctl relocate service -d PDSP -s PDSPBATCH -i PDSP4 -t PDSP3

srvctl relocate service -d PDSP -s PDSPMISCL -i PDSP4 -t PDSP3

srvctl relocate service -d PDSP -s PDSPNODE3 -i PDSP2 -t PDSP3

srvctl relocate service -d PDSP -s PDSPSOCIAL -i PDSP4 -t PDSP3

srvctl relocate service -d PDSP -s SLOGICSVC -i PDSP4 -t PDSP3

Finally check service is on right/proper node.

Check Stream status:

Select apply_name,status from dba_apply;

Select capture_name,status from dba_capture;

If they are not enabled,start them like below:

Login as strmadmin user:

exec DBMS_CAPTURE_ADM.START_CAPTURE(capture_name => ‘Capture_name’);

exec DBMS_APPLY_ADM.START_APPLY(apply_name => 'APPLY_name');

then check capture and apply status again,make sure they are started and working.

Fire below query and make sure capture_time  should always be changing,so that means capture process is working good.

SELECT

c.CAPTURE_NAME,

to_char(CAPTURE_TIME, 'dd-mon-yy hh24:mi:ss') CAPTURE_TIME,

c.capture_message_number CAPTURE_MSG,

c.STATE,

c.TOTAL_MESSAGES_CAPTURED TOT_MESG_CAPTURE,

c.TOTAL_MESSAGES_ENQUEUED TOT_MESG_ENQUEUE,

SUBSTR(s.PROGRAM,INSTR(s.PROGRAM,'(')+1,4) PROCESS_NAME,

c.SID,

c.inst_id,

s.event

FROM GV$STREAMS_CAPTURE c, GV$SESSION s

WHERE c.SID = s.SID

and c.inst_id=s.inst_id

AND c.SERIAL# = s.SERIAL#  order by c.CAPTURE_NAME;

Fire below query and make sure APPLY_CREATE_TIME  should always be changing, so that means apply  process is working good.

select ac.apply_name, ac.state,

to_char(applied_message_create_time, 'dd-mon-yyyy hh24:mi:ss') APPLY_CREATE_TIME,

round((sysdate-applied_message_create_time)*86400) "LATENCY_IN_SEC"

from dba_apply_progress ap,GV$STREAMS_APPLY_COORDINATOR ac

where ac.apply_name=ap.apply_name

order by apply_name;

Check agent status:

oracle@ora-plus-p-3:PDSP3:/u01/home/oracle $ cd /u01/app/oracle/agent12c/agent_inst/bin/

oracle@ora-plus-p-3:PDSP3:/u01/app/oracle/agent12c/agent_inst/bin $ ./emctl status agent

Oracle Enterprise Manager Cloud Control 12c Release 4

Copyright (c) 1996, 2014 Oracle Corporation.  All rights reserved.

---------------------------------------------------------------

Agent is Not Running

oracle@ora-plus-p-3:PDSP3:/u01/app/oracle/agent12c/agent_inst/bin $ ./emctl start agent

Oracle Enterprise Manager Cloud Control 12c Release 4

Copyright (c) 1996, 2014 Oracle Corporation.  All rights reserved.

Starting agent .................. started.

oracle@ora-plus-p-3:PDSP3:/u01/app/oracle/agent12c/agent_inst/bin $ ./emctl status agent

Oracle Enterprise Manager Cloud Control 12c Release 4

…………………..

---------------------------------------------------------------

Agent is Running and Ready

Check if this is a goldengate node. Unfortunately,  this node3 is a goldengate  node. So need to

Start mgr/extract/pump processed which are abended.

ggsci

start mgr

start xxxx

If they are started successfully and working good(RBA are moving ),then we are lucky and good.

But for some cases, they may can’t start or when started got hung. We can refer to below document.

(this doc recorded some solutions with many goldengate issues,will share it at another blog)

In this case, after starting r1_cp,r2_cp,r3_cp,RBA didn’t moving ,send status command get timeout,

So they probably get hung.

So try to kill them and restart, but still no use .

Then look into database side:

SELECT s.sid,s.serial#,s.inst_id,s.sql_id,last_call_et "Run_in_sec",s.osuser "OS_user",s.machine,a.sql_text,

s.module,s.event,s.blocking_session

FROM     gv$session s,gv$sqlarea a

WHERE   s.sql_id = a.sql_id(+)  and    s.inst_id=a.inst_id  and status='ACTIVE'  and username='GGADMIN'

and type='USER'    order by last_call_et desc;

From sql result, we can see a lot of locks are blocking goldengate processes.

So we can know that goldengate hung processes are caused by these blocking sessions.

And after sometime, the locks still exist. So we need to send mail to APP team to check if can kill these sessions.

Just like below:

After their permission, we can kill these sessions, and then restart r1-r3 processes  ,goldengate  works good.

At other side, we need to find why this node reboot.

We can always find useful information in other survived nodes.

In this case, node3  rebooted, I search some info on node1 like below:

oracle@ora-plus-p-1:PDSP1:/u01/home/oracle $ cd /u01/app/11.2.0.4/grid/log/ora-plus-p-1/

oracle@ora-plus-p-1:PDSP1:/u01/app/11.2.0.4/grid/log/ora-plus-p-1 $ less alertora-plus-p-1.log

2016-10-18 09:46:01.809:

[cssd(2206)]CRS-1612:Network communication with node ora-plus-p-3 (3) missing for 50% of timeout interval.  Removal of this node from cluster in 14.610 seconds

2016-10-18 09:46:09.858:

[cssd(2206)]CRS-1611:Network communication with node ora-plus-p-3 (3) missing for 75% of timeout interval.  Removal of this node from cluster in 6.560 seconds

2016-10-18 09:46:13.860:

[cssd(2206)]CRS-1610:Network communication with node ora-plus-p-3 (3) missing for 90% of timeout interval.  Removal of this node from cluster in 2.560 seconds

Also we can check node3’s osw network file to confirm if any network errors:

cd /cust/app/oracle/OSW/oswbb/archive/oswnetstat/

cat ora-plus-p-3.va2.b2c.nike.com_netstat_16.11.28.1400.dat|grep -in "receive errors"

cat ora-plus-p-3.va2.b2c.nike.com_netstat_16.11.28.1400.dat|grep -in timeout

we can get many packet receive error and timeout error from above commands:

312009 packet receive errors

RcvbufErrors: 818

SndbufErrors: 6294

312009 packet receive errors

RcvbufErrors: 818

SndbufErrors: 6294

So next step,we need work with linux team,network team to work with this.

And  for  else node eviction case, we can also use this method to troubleshoot.

======================ENDED==================================================

What Need To Do when A Node down!的更多相关文章

  1. babeljs源码

    babel.min.js!function(e,t){"object"==typeof exports&&"object"==typeof mo ...

  2. NPM (node package manager) 入门 - 基础使用

    什么是npm ? npm 是 nodejs 的包管理和分发工具.它可以让 javascript 开发者能够更加轻松的共享代码和共用代码片段,并且通过 npm 管理你分享的代码也很方便快捷和简单. 截至 ...

  3. node服务的监控预警系统架构

    需求背景 目前node端的服务逐渐成熟,在不少公司内部也开始承担业务处理或者视图渲染工作.不同于个人开发的简单服务器,企业级的node服务要求更为苛刻: 高稳定性.高可靠性.鲁棒性以及直观的监控和报警 ...

  4. node.js学习(三)简单的node程序&&模块简单使用&&commonJS规范&&深入理解模块原理

    一.一个简单的node程序 1.新建一个txt文件 2.修改后缀 修改之后会弹出这个,点击"是" 3.运行test.js 源文件 使用node.js运行之后的. 如果该路径下没有该 ...

  5. 细说WebSocket - Node篇

    在上一篇提高到了 web 通信的各种方式,包括 轮询.长连接 以及各种 HTML5 中提到的手段.本文将详细描述 WebSocket协议 在 web通讯 中的实现. 一.WebSocket 协议 1. ...

  6. 高大上的微服务可以很简单,使用node写微服务

    安装 npm install m-service --save 使用 编写服务处理函数 // dir1/file1.js // 使用传入的console参数输出可以自动在日志里带上request id ...

  7. 构建通用的 React 和 Node 应用

    这是一篇非常优秀的 React 教程,这篇文章对 React 组件.React Router 以及 Node 做了很好的梳理.我是 9 月份读的该文章,当时跟着教程做了一遍,收获很大.但是由于时间原因 ...

  8. 利用Node.js的Net模块实现一个命令行多人聊天室

    1.net模块基本API 要使用Node.js的net模块实现一个命令行聊天室,就必须先了解NET模块的API使用.NET模块API分为两大类:Server和Socket类.工厂方法. Server类 ...

  9. Node.js:进程、子进程与cluster多核处理模块

    1.process对象 process对象就是处理与进程相关信息的全局对象,不需要require引用,且是EventEmitter的实例. 获取进程信息 process对象提供了很多的API来获取当前 ...

  10. Node.js:理解stream

    Stream在node.js中是一个抽象的接口,基于EventEmitter,也是一种Buffer的高级封装,用来处理流数据.流模块便是提供各种API让我们可以很简单的使用Stream. 流分为四种类 ...

随机推荐

  1. 高性能的分布式服务框架 Dubbo

    我思故我在,提问启迪思考! 1. 什么是Dubbo? 官网:http://dubbo.io/,DUBBO是一个分布式服务框架,致力于提供高性能和透明化的RPC远程服务调用方案,以及作为SOA服务治理的 ...

  2. malloc和new的区别

    (1)malloc在C和C++中都可以使用,用来申请一段内存:申请的内存一定要用free释放,然后把指针置为null: new只能在C++中使用,用于动态内存分配:new的对象要delete掉: (2 ...

  3. Git基本使用教程

    1.创建版本库      版本库又可以称为仓库(repository),可以简单理解为一个目录,在这个目录下的所有文件都可以被git管理起来,每个文件的新增.修改.删除Git都可以跟踪,以便在任何时刻 ...

  4. Java异常的中断和恢复

    中断:抛出一个异常类的实例而终止现有程序的执行:恢复:不是抛出一个异常类的实例,而是调用一个用于解决问题的方法或就地解决问题. 在Java中,对那些要调用方法的客户程序员,我们要通知他们可能从自己的方 ...

  5. ConcurrentHashMap原理分析

    当我们享受着jdk带来的便利时同样承受它带来的不幸恶果.通过分析Hashtable就知道,synchronized是针对整张Hash表的,即每次锁住整张表让线程独占,安全的背后是巨大的浪费,而现在的解 ...

  6. Ubuntu 12.10 配置MyEclipes 10.7环境(加破解)

    下周要在Ubuntu中调试程序,所以今天抽空先配置好Myeclipse环境. 准备: JDK  下载地址: http://www.oracle.com/technetwork/java/javase/ ...

  7. 阿里巴巴开源Weex 开发教程

    Weex 是什么 Weex是阿里发布的一款用WEB方式开发原生app的开源产品 Weex能够完美兼顾性能与动态性,让移动开发者通过简捷的前端语法写出Native级别的性能体验,并支持iOS.安卓.Yu ...

  8. ZeroClipboard 复制到剪贴板

    使用 ZeroClipboard 可以简单的将内容复制到剪贴板,通过 Adobe Flash 和 JavaScript 来实现.“Zero” 意义为这个类库没有界面,界面需要由你来建立. 版本: Ze ...

  9. EasyDropDown – 很棒的下拉菜单,含精美主题

    EasyDropDown 是一个 jQuery 插件,你可以毫不费力地将简陋的 Select 元素设置为可定制风格的下拉菜单,用于表单或者一般的导航.和著名的下拉插件 Chosen 很像,但是具有自己 ...

  10. Snabbt.js – 极简的 JavaScript 动画库

    Snabbt.js 是一个简约的 JavaScript 动画库.它会平移,旋转,缩放,倾斜和调整你的元素.通过矩阵乘法运算,变换等可以任何你想要的方式进行组合.最终的结果通过 CSS3 变换矩阵设置. ...