KVM高可用性CS4.2暂时没有实现

The Linux Kernel Virtual Machine (KVM) is a very popular hypervisor choice amongst CloudStack and OpenStack users. It is free and comes ready with popular Linux distributions like CentOS/RedHat and Ubuntu. In some cases, customers insist on using and end to end Open Source solution for their private cloud and KVM ends up being the only choice available.

So in a recent deployment experience, where the private cloud had to run on full open source mature solutions, the obvious choice was to use Apache CloudStack 4.1 and KVM (on CentOS 6.x) Hypervisors. Building the management tier and the KVM hosts itself was a breeze with CentOS KickStart and the SSH based Ansible for post-install configuration of services.

The infrastructure too was built for resilience – dual power supplies, dual 4 port network controllers wired across east-west switches with LACP, HA for storage. From the CloudStack side, the management servers behind load balancers with MySQL replication services, multiple PODs, multiple Clusters and multiple Hosts in a cluster. Also, new service offerings created with HA enabled.

One of the resilience tests was to simply power off a random KVM hypervisor within a logical cluster and watch the affected HA enabled VM(s) auto start on another host within the same cluster after the time out period. To everyones surprise, the Guest VMs just sat there marked in ‘Up’ state despite physically being offline. A close look at the management logs show little to no activity that CloudStack even cared for these affected guest VMs and the KVM host.

CloudStack VM HA with KVM was simply not working.

After spending some time on the Apache CloudStack mailing lists and JIRA, it turns out that its a CloudStack feature to “do nothing” in a host down scenario. This is primarily to avoid any split brain situations where we could potentially end up with the multiple copies of the guest VMs running on more than one physical host due to network connectivity problems. Since KVM does not have in built clustering/HA features, it is up to the CloudStack layer to decide on a corrective course of action. At this time, CloudStack simply chooses to ignore failed KVM hosts.

The situation could be even more problematic if you unfortunately happen to have the CloudStack “virtual router” also running on the failed host. All basic network services like DHCP, DNS and routing for that POD will fail as the router would be offline. This actually happened to a someone on the mailing lists. The “fix” would be to go into the CloudStack database and mark the Virtual Router as “destroyed”. CloudStack would then create a newvirtual router and services would resume.

This issue is currently being discussed in this No HA actions are performed when a KVM host goes offline JIRA Ticket and there is developer interest in coming up with a solution for an upcoming Apache CloudStack 4.1.x release. Also see the thread HA not working – CloudStack 4.1.0 and KVM hypervisor hosts on cloudstack-users mailing list.

Please note that this problem is specific to KVM hypervisors only as they do not have in-built clustering capabilities. CloudStack with VMware and XenServers do not have this issue. Both VMware and XenServers clusters automatically do the right thing using their in-built clustering features.

As a side note, Citrix XenServer 6.2 has been fully open sourced in July and installation ISOs are available from XenServer.Org. Given the enterprise features that XenServer (like HA clustering and fault tolerance) already has over KVM, it is very likely to have massive adoption in fully open source clouds with future releases of Apache CloudStack.

Update: According to this thread, XCP is also affected.

 

资源引用:

http://shankerbalan.net/blog/cloudstack-kvm-ha/

bug号:

https://issues.apache.org/jira/browse/CLOUDSTACK-3535

CloudStack + KVM + HA的更多相关文章

  1. 用vmware workstation制作cloudstack(kvm)镜像及问题解决办法

    说明1:vmware workstation镜像是vmdk格式 说明2:cloudstack配置文件目录:/run/libvirt/qemu/     kvm配置文件目录:/etc/libvirt/q ...

  2. CloudStack+KVM环境搭建(步骤很详细,说明ClockStack是用来管理虚拟机的)

    文章目录环境准备配置本地域名解析关闭selinux安装ntp服务安装管理端安装Mysql数据库安装服务端RPM:初始化CloudStack数据库:初始化cloudstack管理服务器安装系统虚拟机安装 ...

  3. [转] 如何让CloudStack使用KVM创建Windows实例成功识别并挂载数据盘

    在使用kvm给windows虚拟机动态挂载virtio类型的硬盘时候遇到问题,通过下面的文章知道需要安装virtio驱动,从而解决问题使挂在正常,在此处mark一下 问题产生背景: 使用CloudSt ...

  4. CloudStack学习-3

    此次试验主要是CloudStack结合openvswitch 背景介绍 之所以引入openswitch,是因为如果按照之前的方式,一个网桥占用一个vlan,假如一个zone有20个vlan,那么岂不是 ...

  5. CloudStack 4.0.2 vRouter导致重启后状态不正常

    最近总玩CloudStack + KVM,发现在重启CloudStack服务后,host(kvm)的状态老是为alert.日志里出现如下错误提示: ERROR [agent.manager.Agent ...

  6. 我要为运维说一句,我们不是网管,好不!!Are you know?

    运维 运维,这里指互联网运维,通常属于技术部门,与研发.测试.系统管理同为互联网产品技术支撑的4大部门,这个划分在国内和国外以及大小公司间都会多少有一些不同. 一个互联网产品的生成一般经历的过程是:产 ...

  7. Install Open vSwitch on CentOS

    转载:http://cloud-mate.org/2015/06/installing-open-vswitch-centos-cloudstack/  June 5, 2015  Stuart Ne ...

  8. [转] 运维知识体系 -v3.1 作者:赵舜东(赵班长)转载请注明来自于-新运维社区:https://www.unixhot.com

    [From]https://www.unixhot.com/page/ops [运维知识体系]-v3.1 作者:赵舜东(赵班长) (转载请注明来自于-新运维社区:https://www.unixhot ...

  9. CLOUDSTACK HA功能,测试成功

    要注意VM HA和HOST HA两个级别的区别.并且要整合.

随机推荐

  1. Oracle 使用RMAN COPY 移动 整个数据库 位置 示例

    一.数据迁移说明 在DBA的工作中会遇到数据迁移的情况,比如将本地磁盘迁移到ASM,亦或者需要更换存储设备,那么我就需要迁移整个数据库的存储位置. 如果只是移动表空间或者数据文件,我们可以将表空间或者 ...

  2. gitlab安装、配置与阿里云产品集成

    https://www.ilanni.com/?p=12819 一.gitlab安装与部署 gitlab的安装可以分为源码安装和通过安装包进行安装,要是按照我以前的写作习惯的话,我也会把源码安装在本文 ...

  3. 2019Falg

    2019的Flag 2018 2018年对我来说是很重要的一年. 毕业--拿到硕士学位. 工作---成功转行进入互联网行业. 有了她. 上半年忙碌于毕业的各种事情,被毕业论文折磨的要疯,顺利走完所有流 ...

  4. linux CentOS 安装rz和sz命令 lrzsz

    lrzsz在linux里可代替ftp上传和下载. lrzsz 官网入口:http://freecode.com/projects/lrzsz/ lrzsz是一个unix通信套件提供的X,Y,和ZMod ...

  5. Oracle条件分支查询

    Oracle的条件分支查询其实跟java的条件分支语法没啥太大的区别,只不过java多了一个switch关键字而已.看例子: SQL ELSE SUM(t1.TOTALTICKET) END tota ...

  6. C语言中可变形参简单实例

    以下程序主要包括三个主要函数: 一个最简单的可变形参函数实例: 一个简单的printf功能的实例: 一个打印字符串函数(辅助): 其中myPrintf函数,实现了printf的部分简单功能,并没有去实 ...

  7. 算法提高 P1001【大数乘法】

    当两个比较大的整数相乘时,可能会出现数据溢出的情形.为避免溢出,可以采用字符串的方法来实现两个大数之间的乘法.具体来说,首先以字符串的形式输入两个整数,每个整数的长度不会超过8位,然后把它们相乘的结果 ...

  8. Windows远程桌面连接CentOS 7

    1. 安装tigervnc-server yum install tigervnc-server 2. 设置vncserver服务器 将默认提供的文件复制到/etc/systemd/system,命令 ...

  9. cloudera manager卸载流程

    注意:卸载Cloudera Manager后,根据需要保留或者删除集群中的Hadoop数据.下面的命令没有删除Hadoop数据,可以在控制台的Hadoop 和MapReduce /配置/选项卡,查看H ...

  10. [lua]判断nginx收到的是否json

    local post_data = ngx.req.get_body_data() --[[ngx.log(ngx.ERR, 'post data:', post_data)]] local ok, ...