容器状态监控

主要是监控POD的状态包括重启、不健康等等这些k8s api 状态本身会报出来,在配合zabbix报警

导入zabbix模板关联上oc master主机

<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
<version>3.2</version>
<date>--27T07::05Z</date>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<templates>
<template>
<template>OC Pods</template>
<name>OC Pods</name>
<description/>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<applications>
<application>
<name>restartCount</name>
</application>
<application>
<name>RunningStatus</name>
</application>
</applications>
<items/>
<discovery_rules>
<discovery_rule>
<name>OC Pods Discover</name>
<type></type>
<snmp_community/>
<snmp_oid/>
<key>oc.pod.status[discover,discover]</key>
<delay></delay>
<status></status>
<allowed_hosts/>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<delay_flex/>
<params/>
<ipmi_sensor/>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<filter>
<evaltype></evaltype>
<formula/>
<conditions/>
</filter>
<lifetime></lifetime>
<description/>
<item_prototypes>
<item_prototype>
<name>Pod {#POD_NAME} Get Status</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.pod.status[{#POD_NAME},get_status]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units/>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>RunningStatus</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Pod {#POD_NAME} Restarts</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.pod.status[{#POD_NAME},restarts]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units/>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>restartCount</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Pod {#POD_NAME} Running</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.pod.status[{#POD_NAME},running]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units/>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>RunningStatus</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
</item_prototypes>
<trigger_prototypes>
<trigger_prototype>
<expression>{OC Pods:oc.pod.status[{#POD_NAME},running].str(Running_true)}=&#;
and&#;
{OC Pods:oc.pod.status[{#POD_NAME},running].str(Pod deleted)}=</expression>
<recovery_mode></recovery_mode>
<recovery_expression/>
<name>Pod {#POD_NAME} Not Running</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{OC Pods:oc.pod.status[{#POD_NAME},restarts].str(Warning)}=</expression>
<recovery_mode></recovery_mode>
<recovery_expression>{OC Pods:oc.pod.status[{#POD_NAME},restarts].str(Warning,#)}=</recovery_expression>
<name>Pod {#POD_NAME} restarted Warning</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
</trigger_prototypes>
<graph_prototypes/>
<host_prototypes/>
</discovery_rule>
</discovery_rules>
<httptests/>
<macros/>
<templates/>
<screens/>
</template>
</templates>
</zabbix_export>

zabbix客户端配置

修改zabbix_agentd.conf

Timeout=
UserParameter=oc.pod.status[*],/data/app/zabbix/etc/oc_pod_monitor.sh $ $

oc_pod_monitor.sh内容

#!/bin/bash
TOKEN=""
ENDPOINT=""
POD_NAME="`echo "$" |sed 's/.*=\(.*$\)/\1/'`"
Monitoring_type="$2"
WORKSPACE="/data/tmp/oc_monitor"
mkdir -p $WORKSPACE #通过pod name获得pod所在的namespace
NAMESPACE="`jq -r '.items |.[] |.metadata |.name,.namespace' $WORKSPACE/all_pods.json |grep -A1 $POD_NAME |grep -v $POD_NAME`" #验证pod是否存在
if [ "$POD_NAME" == "discover" ]; then
echo
elif [ ! -n "$NAMESPACE" ]; then
echo "Pod deleted"
exit
fi
##自动发现
case $Monitoring_type in
discover)
#获取所有pod只保留pod name
curl -k \
-H "Authorization: Bearer $TOKEN" \
-H 'Accept: application/json' \
https://$ENDPOINT/api/v1/pods 2>/dev/null > $WORKSPACE/all_pods.json Pod_Name=(`jq -r '.items | .[] | .metadata | .name' $WORKSPACE/all_pods.json |egrep -v 'build|deploy|debug'`)
#转换为json格式
printf "{\n"
printf '\t"data":[\n'
for ((i=;i<${#Pod_Name[@]};i++))
do
NAMESPACE="`jq -r '.items |.[] |.metadata |.name,.namespace' $WORKSPACE/all_pods.json |grep -A1 ${Pod_Name[i]} |grep -v ${Pod_Name[i]}`"
Pod_Name_N=""$NAMESPACE"="${Pod_Name[i]}""
printf '\t\t{\n'
num=$(echo $((${#Pod_Name[@]}-)))
if [ "$i" == ${num} ];
then
printf "\t\t\t\"{#POD_NAME}\":\"${Pod_Name_N}\"}\n"
else
printf "\t\t\t\"{#POD_NAME}\":\"${Pod_Name_N}\"},\n"
fi
done
printf "\t]\n"
printf "}\n"
exit
;; get_status)#获取pod状态以供所有项目调用
curl -k \
-H "Authorization: Bearer $TOKEN" \
-H 'Accept: application/json' \
https://${ENDPOINT}/api/v1/namespaces/$NAMESPACE/pods/$POD_NAME/status 2>/dev/null > $WORKSPACE/${NAMESPACE}-${POD_NAME}.status
Pod_NotFound="`cat $WORKSPACE/${NAMESPACE}-${POD_NAME}.status |grep '"code": 404'`"
if [ -n "$Pod_NotFound" ]; then
echo "Pod_Status=NotFound"
exit
else
echo "Success"
exit
fi
;;
esac #获取pod状态数据
if [ -f "$WORKSPACE/${NAMESPACE}-${POD_NAME}.status" ];then
Pod_Status="`cat $WORKSPACE/${NAMESPACE}-${POD_NAME}.status`"
else
echo "" > $WORKSPACE/${NAMESPACE}-${POD_NAME}.status
Pod_Status="`cat $WORKSPACE/${NAMESPACE}-${POD_NAME}.status`"
fi #处理Pod_Status的异常
if [ ! -n "$Pod_Status" ]; then #处理Pod_Status的为空的异常
echo "Running_true Pod_Status=Null"
exit
elif [ -n "`echo "$Pod_Status" |grep '"code": 404'`" ]; then #处理pod不存在但是all_pods.json还没更新的异常
echo "Pod_Status=NotFound"
exit
elif [ "`echo "$Pod_Status" |jq -r '.status |.phase'`" = "Pending" ]; then #验证容器是否在Pending状态
echo "Pending"
exit
fi #选择要获取的数据
case $Monitoring_type in
restarts)#监控pod是否重启过
#判断是否是新pod
if [ ! -f "$WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount" ]; then
echo "Warning New Pod"
echo "" > $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount
exit
fi ##获取上次的值
A_line=`sed -n 1p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`
B_line_null="`sed -n 2p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`"
if [ ! -n "$B_line_null" ]; then #处理有两个restartCount值的pod
B_line=""
else
B_line=`sed -n 2p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`
fi
Last_state=`expr $A_line + $B_line`
## ##获取本次的值
echo "$Pod_Status" |jq -r '.status |.containerStatuses |.[] |.restartCount' > $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount
A_line=`sed -n 1p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`
B_line_null="`sed -n 2p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`"
if [ ! -n "$B_line_null" ]; then #处理有两个restartCount值的pod
B_line=""
else
B_line=`sed -n 2p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`
fi
Current_state=`expr $A_line + $B_line`
## #对比本次拿到的restartCount值与上此的restartCount值
if [ "$Current_state" -gt "$Last_state" ]; then
Restart_status="Warning restart_count=$Current_state"
else
Restart_status="Normal restart_count=$Current_state"
fi
echo "$Restart_status"
;; running)#监控pod的运行状态和容器的状态返回字符串 #获取pod和容器的状态
running_status=`echo "$Pod_Status" |jq -r '.status |.phase'`
Container_status="`echo "$Pod_Status" |jq -r '.status |.containerStatuses |.[] |.ready' |grep false`"
if [ ! -n "$Container_status" ]; then
Container_status="_true"
else
Container_status="_false"
fi
echo "${running_status}${Container_status}"
;; *)
echo "Error parameters"
exit
;; esac
exit

这样POD重启或者新建都会报出来

集群NODE节点监控

主要监控node节点的不健康状态,还有lvm卷容量监控

导入zabbix模板关联上oc master主机

<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
<version>3.2</version>
<date>--27T07::32Z</date>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<templates>
<template>
<template>OC Node Status</template>
<name>OC Node Status</name>
<description/>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<applications>
<application>
<name>oc_node</name>
</application>
</applications>
<items/>
<discovery_rules>
<discovery_rule>
<name>OC Nodes Discover</name>
<type></type>
<snmp_community/>
<snmp_oid/>
<key>oc.node.status[discover,discover]</key>
<delay></delay>
<status></status>
<allowed_hosts/>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<delay_flex/>
<params/>
<ipmi_sensor/>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<filter>
<evaltype></evaltype>
<formula/>
<conditions/>
</filter>
<lifetime></lifetime>
<description/>
<item_prototypes>
<item_prototype>
<name>Node {#NODE_NAME} DiskPressure</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.node.status[{#NODE_NAME},DiskPressure]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units/>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>oc_node</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Node {#NODE_NAME} Get Status</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.node.status[{#NODE_NAME},get_status]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units/>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications/>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Node {#NODE_NAME} MemoryPressure</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.node.status[{#NODE_NAME},MemoryPressure]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units/>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>oc_node</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Node {#NODE_NAME} Ready</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.node.status[{#NODE_NAME},node_ready]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units/>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>oc_node</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Node {#NODE_NAME} CPU Limits</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.node.status[{#NODE_NAME},node_resources,cpu_limits]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units>%</units>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>oc_node</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Node {#NODE_NAME} CPU Requests</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.node.status[{#NODE_NAME},node_resources,cpu_requests]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units>%</units>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>oc_node</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Node {#NODE_NAME} Memory Limits</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.node.status[{#NODE_NAME},node_resources,memory_limits]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units>%</units>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>oc_node</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Node {#NODE_NAME} Memory Requests</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.node.status[{#NODE_NAME},node_resources,memory_requests]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units>%</units>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>oc_node</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
<item_prototype>
<name>Node {#NODE_NAME} OutOfDisk</name>
<type></type>
<snmp_community/>
<multiplier></multiplier>
<snmp_oid/>
<key>oc.node.status[{#NODE_NAME},OutOfDisk]</key>
<delay></delay>
<history></history>
<trends></trends>
<status></status>
<value_type></value_type>
<allowed_hosts/>
<units/>
<delta></delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel></snmpv3_securitylevel>
<snmpv3_authprotocol></snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol></snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula></formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type></data_type>
<authtype></authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link></inventory_link>
<applications>
<application>
<name>oc_node</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<application_prototypes/>
</item_prototype>
</item_prototypes>
<trigger_prototypes>
<trigger_prototype>
<expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_resources,cpu_limits].last()}&gt;</expression>
<recovery_mode></recovery_mode>
<recovery_expression/>
<name>Node {#NODE_NAME} CPU Limits %</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_resources,cpu_requests].last()}&gt;</expression>
<recovery_mode></recovery_mode>
<recovery_expression/>
<name>Node {#NODE_NAME} CPU Requests %</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{OC Node Status:oc.node.status[{#NODE_NAME},DiskPressure].str(DiskPressure_False)}=</expression>
<recovery_mode></recovery_mode>
<recovery_expression/>
<name>Node {#NODE_NAME} DiskPressure</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_resources,memory_limits].last()}&gt;</expression>
<recovery_mode></recovery_mode>
<recovery_expression/>
<name>Node {#NODE_NAME} Memory Limits %</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{OC Node Status:oc.node.status[{#NODE_NAME},MemoryPressure].str(MemoryPressure_False)}=</expression>
<recovery_mode></recovery_mode>
<recovery_expression/>
<name>Node {#NODE_NAME} MemoryPressure</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_resources,memory_requests].last()}&gt;</expression>
<recovery_mode></recovery_mode>
<recovery_expression/>
<name>Node {#NODE_NAME} Memory Requests %</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_ready].str(Ready_True)}=</expression>
<recovery_mode></recovery_mode>
<recovery_expression/>
<name>Node {#NODE_NAME} Not Ready</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{OC Node Status:oc.node.status[{#NODE_NAME},OutOfDisk].str(OutOfDisk_False)}=</expression>
<recovery_mode></recovery_mode>
<recovery_expression/>
<name>Node {#NODE_NAME} OutOfDisk</name>
<correlation_mode></correlation_mode>
<correlation_tag/>
<url/>
<status></status>
<priority></priority>
<description/>
<type></type>
<manual_close></manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
</trigger_prototypes>
<graph_prototypes/>
<host_prototypes/>
</discovery_rule>
</discovery_rules>
<httptests/>
<macros/>
<templates/>
<screens/>
</template>
</templates>
</zabbix_export>

zabbix客户端配置

修改zabbix_agentd.conf

Timeout=
UserParameter=oc.node.status[*],/data/app/zabbix/etc/oc_node_monitor.sh $ $ $

oc_node_monitor.sh的内容

#!/bin/bash
TOKEN=""
ENDPOINT=""
NODE_NAME="$1"
Monitoring_type="$2"
WORKSPACE="/data/tmp/oc_monitor"
mkdir -p $WORKSPACE case $Monitoring_type in
discover)#自动发现节点
Node_Name=(`curl -k \
-H "Authorization: Bearer $TOKEN" \
-H 'Accept: application/json' \
https://$ENDPOINT/api/v1/nodes 2>/dev/null |jq -r '.items|.[]|.metadata|.name'`) printf "{\n"
printf '\t"data":[\n'
for ((i=;i<${#Node_Name[@]};i++))
do
printf '\t\t{\n'
num=$(echo $((${#Node_Name[@]}-)))
if [ "$i" == ${num} ];
then
printf "\t\t\t\"{#NODE_NAME}\":\"${Node_Name[$i]}\"}\n"
else
printf "\t\t\t\"{#NODE_NAME}\":\"${Node_Name[$i]}\"},\n"
fi
done
printf "\t]\n"
printf "}\n"
exit
;;
get_status)#获取node状态以供所有项目调用
curl -k \
-H "Authorization: Bearer $TOKEN" \
-H 'Accept: application/json' \
https://${ENDPOINT}/api/v1/nodes/$NODE_NAME 2>/dev/null > $WORKSPACE/${NODE_NAME}.status
if [ -n "`cat $WORKSPACE/${NODE_NAME}.status |grep '"code": 404'`" ]; then
echo "Node_Status=NotFound"
exit
elif [ ! -n "`cat $WORKSPACE/${NODE_NAME}.status`" ]; then
echo "Node_Status=null"
exit
else
echo "Success"
exit
fi
;;
esac case $Monitoring_type in
OutOfDisk)#监控node是否磁盘空间不足
Node_Status="`cat $WORKSPACE/${NODE_NAME}.status |jq -r '.status|.conditions|.[]|.status' | sed -n 1p`"
if [ "$Node_Status" == "False" ]; then
echo "OutOfDisk_False"
elif [ ! -n "$Node_Status" ]; then
echo "OutOfDisk_False"
else
echo "OutOfDisk_$Node_Status"
fi
;; MemoryPressure)#监控node是否磁盘空间不足
Node_Status="`cat $WORKSPACE/${NODE_NAME}.status |jq -r '.status|.conditions|.[]|.status' | sed -n 2p`"
if [ "$Node_Status" == "False" ]; then
echo "MemoryPressure_False"
elif [ ! -n "$Node_Status" ]; then
echo "MemoryPressure_False"
else
echo "MemoryPressure_$Node_Status"
fi
;; DiskPressure)#监控node是否磁盘压力太大
Node_Status="`cat $WORKSPACE/${NODE_NAME}.status |jq -r '.status|.conditions|.[]|.status' | sed -n 3p`"
if [ "$Node_Status" == "False" ]; then
echo "DiskPressure_False"
elif [ ! -n "$Node_Status" ]; then
echo "DiskPressure_False"
else
echo "DiskPressure_$Node_Status"
fi
;; node_ready)#监控node是否准备好了
Node_Status="`cat $WORKSPACE/${NODE_NAME}.status |jq -r '.status|.conditions|.[]|.status' | sed -n 4p`"
if [ "$Node_Status" == "True" ]; then
echo "Ready_True"
elif [ ! -n "$Node_Status" ]; then
echo "Ready_True"
else
echo "Ready_$Node_Status"
fi
;; node_resources)#监控node资源分配情况
null="`cat $WORKSPACE/${NODE_NAME}.resources |awk '{print $2}'`"
if [ ! -n "$null" ]; then
sleep
fi
if [ "$3" == "cpu_requests" ]; then
data="`cat $WORKSPACE/${NODE_NAME}.resources |awk '{print $2}' |grep -o '[0-9]*'`"
if [ $data -gt ]; then
echo $data
else
echo
fi
elif [ "$3" == "cpu_limits" ]; then
data="`cat $WORKSPACE/${NODE_NAME}.resources |awk '{print $4}' |grep -o '[0-9]*'`"
if [ $data -gt ]; then
echo $data
else
echo
fi elif [ "$3" == "memory_requests" ]; then
data="`cat $WORKSPACE/${NODE_NAME}.resources |awk '{print $6}' |grep -o '[0-9]*'`"
if [ "$data" -gt ]; then
echo $data
else
echo
fi elif [ "$3" == "memory_limits" ]; then
data="`cat $WORKSPACE/${NODE_NAME}.resources |awk '{print $8}' |grep -o '[0-9]*'`"
if [ $data -gt ]; then
echo $data
else
echo
fi
fi
;;
esac

crontab -e

*/ * * * * /data/scripts/oc_master_crontab.sh >/dev/null >&

oc_master_crontab.sh内容

node_name=(`oc get node |grep -v "NAME" |awk '{print $1}'`)
for ((i=;i<${#node_name[*]};i++))
do
oc describe node "${node_name[i]}" |grep -B "Events" |grep -v "Events" > /data/tmp/oc_monitor/${node_name[i]}.resources
chmod -R /data/tmp/
done

openshift 容器云从入门到崩溃之九《容器监控-报警》的更多相关文章

  1. openshift 容器云从入门到崩溃之一《容器能解决什么问题》

    容器前时代 说到容器大多数人想到的就是docker,docker的迅速崛起使得使用容器的门槛大大降低了,我第一次接触docker还是14年,那时候作为一名运维部署应用还在大量使用虚拟化,从vmware ...

  2. openshift 容器云从入门到崩溃之八《日志聚合》

    日志可以分为两部分 业务日志 业务日志一般是要长期保留的,以供以后有问题随时查询,elk是现在比较流行的日志方案,但是容器日志最好不要落地所以不能把logstash客户端包在容器里面 可以使用logs ...

  3. openshift 容器云从入门到崩溃之六《Source-to-Image》

    上次说到了怎么在oc上面部署应用而且说道了怎么定义模板部署应用,也许你会奇怪那个我代码打包编译在哪一步,那就要说道oc的s2i流程了 下面是基本s2i流程 1.制作base-image镜像 要使用s2 ...

  4. openshift 容器云从入门到崩溃之五《部署应用》

    1.配置部署模板 配置好用户权限之后就可以部署应用了oc常用的两种部署方式: Deploy Image方式 优点:这种方式是最简单的部署方式,你只需要有一个容器镜像就行了或者公开的docker hub ...

  5. openshift 容器云从入门到崩溃之二《准备环境》

    openshift 从3.9开始就开始支持系统组件在容器里运行了,之前版本都是直接运行在操作系统上,名字也改了叫OKD 目前最新的稳定版本是3.11,所以就安装3.11版本 准备环境: 主机名 系统 ...

  6. openshift 容器云从入门到崩溃之七《数据持久化》

    数据持久化常用的有两种: hostPath 挂载容器宿主机的本地文件夹,直接修改pod的配置 volumes: - hostPath: path: /data/logging-es type: '' ...

  7. openshift 容器云从入门到崩溃之三《安装openshift》

    准备好环境,在安装之前请先了解openshift提供的ansible有大量的安装选项 文档地址:https://docs.okd.io/latest/install/configuring_inven ...

  8. openshift 容器云从入门到崩溃之十《容器监控-数据展示》

    POD资源历史曲线(CPU.内存.网络) 监控方案heapster+hawkular-metrics+hawkular-cassandra heapster负责收集数据 hawkular-cassan ...

  9. openshift 容器云从入门到崩溃之四《配置用户验证》

    1.配置本地用户 之前安装的时候选择了htpasswd验证方式 先创建用户 # htpasswd -c /etc/origin/master/htpasswd admin 授权为集群管理员 # oc ...

随机推荐

  1. curl命令例解

    curl -i --url "https://open.abc.com/ddn/purge/ItemIdReceiver" \-X "POST" \-u &qu ...

  2. SparkStreaming:关于checkpoint的弊端

    当使用sparkstreaming处理流式数据的时候,它的数据源搭档大部分都是Kafka,尤其是在互联网公司颇为常见. 当他们集成的时候我们需要重点考虑就是如果程序发生故障,或者升级重启,或者集群宕机 ...

  3. 【nodejs】初识 NodeJS(二)

    上一节我们构建了一个基础的 http 服务器,我们可以接收 http 请求,但是我们得做点什么吧 – 不同的 http 请求,服务器应该有不同的响应. 路由模块 处理不同的 http 请求在我们的代码 ...

  4. VS IISExpress REST DELETE 405 Method Not Allowed

    [参考].net IIS MVC Rest api 跨域 PUT DELETE 404 无法使用问题解决方案 今日在使用泛型處理常式處理檔案上傳時,使用了 HTTP 動詞的 PUT.DELETE 進行 ...

  5. Ubuntu下安装antlr-4.7.1

    简介:antlr工具将语法文件转换成可以识别该语法文件所描述的语言的程序. 例如:给定一个识别json的语法,antlr工具将会根据该语法生成一个程序,该程序可以通过antlr运行库来识别输入的jso ...

  6. openwrt官方固件怎么中继网络

    关键一点,取消勾

  7. Matrix 高斯消元Gaussian elimination 中的complete pivoting和partial pivoting

    首先科普下Pivoting的含义 一般翻译为“主元”,在对矩阵做某种算法时,首先进行的部分元素.在线性规划的单纯形法中常见.wiki的解释如下:Pivot element(the first elem ...

  8. <转>从K近邻算法、距离度量谈到KD树、SIFT+BBF算法

    转自 http://blog.csdn.net/likika2012/article/details/39619687 前两日,在微博上说:“到今天为止,我至少亏欠了3篇文章待写:1.KD树:2.神经 ...

  9. 给vscode添加右键打开功能

    将以下文本存为vscode.reg,然后运行: Windows Registry Editor Version 5.00  ; Open files [HKEY_CLASSES_ROOT\*\shel ...

  10. Codeforces Round #496 (Div. 3)

    一如既往地四题...好久没切了 有点犯困了明显脑子感觉不够灵活. 为了熟练度还是用java写的,,,导致观赏性很差...我好不容易拉了个队友一起切结果过掉a就tm挂机了!!! A题竟然卡了,,,用了十 ...