Visible Ops
Link:http://www.wikisummaries.org/Visible_Ops
Contents[hide] |
What is ITIL?
- ITIL = Information Technology Infrastructure Library
- A "drastically different approach to IT" (p79)
- A "maturity path for IT that is not based on technology" (p79)
- A "collection of best practices codified in seven books by the Office of Government Commerce in the U.K." (p85)
- A collection "without prioritization or any prescriptive structure" (p18)
- Used by Visible Ops authors as a framework "to normalize terminology" and categorize traits shared across studied high performing organizations (p18-20)
Introduction
(p10-24)
What is Visible Ops?
- Highest ROI best practices divided into four prioritized and incremental Phases
- All ideas are mapped to ITIL terminology
- Intended to be an "on-ramp" to ITIL
Key premises to the Visible Ops rational
- 80% of unplanned outages are due to ill-planned changes made by administrators ("operations staff") or developers
- 80% of Mean Time To Repair (MTTR) is spent determining what changed
- With the right processes in place, it is easier, better, and more predictable to rebuild infrastructure than to repair it
- Concentrating staff time on pre-production efforts is more efficient and less expensive due to the high cost of repairing defects while in production
- Without process controls, pieces of infrastructure often become like unique snowflakes or irreplaceable works of art ... only understood by the "rocket scientist" creator who's time is tied to maintaining it (p41)
- "You can not manage what you can not measure" (p59)
Phase One: Stabilize the Patient
(p25-40)
Goals
- Identify most critical IT systems generating the most unplanned work
- Stabilize infrastructure (prioritizing the most fragile components)
- Create a "culture of causality" where all changes are viewed as key risks that need to managed by facts rather than by beliefs
- Reduce unplanned work to 25% or less (high performers achieve lower than 5%)
- Maximize change success rates (high performers hit 98%)
- Minimize Mean Time to Repair (MTTR)
- Ensure security specialists become part of the decision process
- Shift staff time from "perpetual firefighting to more proactive work that addresses the root causes of problems"
- Minimize the IT failures that cause stress and damage IT's reputation
- Increase the overall level of confidence in IT
- Collect data to affirm the new processes and foster an understanding that any previous perceptions of nimbleness and speed were not factoring in time spent troubleshooting and doing unplanned work
Recommended key steps (to be implemented on most fragile systems first)
- Reduce or eliminate change privileges to fragile infrastructure
- Why? Every time a change is made you risk breaking functionality
- Create scheduled maintenance windows where all changes are made
- Why? Scheduled changes are more visible, and are more likely to be planned and tested before going into production
- Automate daily scans to detect and report changes
- Why? To automatically verify and log that all scheduled changes were made ... and that no other changes were made
- Warning: Due to their collected data, the authors strongly recommend that even the most trusted administrators still work under automated detections
- Disclosure: One of the authors is the CTO at Tripwire, Inc, the manufacturer of the recommended software for these automated scans....
- When troubleshooting incidents, first analyze the recent changes (approved and detected) to isolate likely causes before recommending additional changes
- Schedule a weekly Change Advisory Board (CAB) made up of representatives from operations, networking, security and the service desk
- Why? To ensure key stakeholders collectively inform and influence change decisions
- Create a Change Advisory Board - Emergency Committee (CAB/EC) who can assemble quickly to review emergency change requests
- Why? "Emergency changes are the most critical to scrutinize"
- Create a Change Request Tracking System to document and track requests for changes (RFCs) through authorization, verification, and implementation processes
- Why? To facilitate the change approval process and to generate reports with metrics
Phase Two: Catch & Release and Find Fragile Artifacts
(p41-46)
Goals
- Prioritize IT's most critical services
- Identify critical pieces of production infrastructure (hardware and software)
- Identify interdependencies between components of production infrastructure
- Foster organizational learning
- Identify the high-risk "fragile artifacts"
Recommended key steps
- Create a prioritized service catalog that documents the most critical services
- Create a Configuration Management Database (CMDB) that illustrates mappings between services and infrastructure, and shows the interdependencies between all configuration items (CI)
- Freeze all related configurations for an agreed upon change-free window
- Why? To ensure an accurate inter-related configuration inventory (see below)
- Inventory all equipment and software in the data center, recording the whos, whats, interdependencies and history for each item
- Why? To facilitate faster problem management and to inform change decisions
- Note: This inventory should be implemented by the most senior staff to ensure the most knowledgeable capturing of configuration details and histories
- Identify the "fragile artifacts" that have the worst historical change success rates and/or the least technical mastery by the supporting technicians, and prioritize them by the criticality of the services they provide
- Why? To create a prioritized list of servers to rebuild in Phase Three
- To the extent possible, place fragile artifacts under a permanent configuration freeze until they can be replaced by complete rebuilds in Phase Three
Phase Three: Establish Repeatable Build Library
(p47-58)
Goals
- Remove processes that encourage heroics in rewarding vigilant firefighters
- Increase team-level technical mastery of production infrastructure
- Shift senior staff from firefighting to fire prevention
- Ensure that critical infrastructure can be easily rebuilt
- Enable a new troubleshooting process with a short, predictable Mean Time To Repair (MTTR)
- Ensure perfect configuration synchronization between pre-production and production servers
- Ensure all configurations and build processes are completely documented
Recommended key steps (to be implemented on most fragile systems first)
- Create and maintain a versioned, Definitive Software Library (DSL) for all acquired and custom developed software and patches
- Note: additions must be approved by the Change Approval Board (CAB)
- Exception: at the time of initial creation, all currently used production software will be accepted into the DSL under a one year grace period
- Create a team of release management engineers from your most senior operations staff. Only more junior staff will be on the production operations team.
- Prevent developers and the release management engineers (previously the senior operations staff) from accessing production infrastructure
- Reason 1: Policy encourages recommended changes to be error free with bullet-proof installation and back-out processes in place
- Reason 2: Process verifies completeness and accuracy of documentation for installation and operations procedures
- Release management engineers create automated, consolidated, integrated, patched, tested, security scanned, layer-able build packages which will then be provisioned onto production infrastructure by the more junior, production operations staff
- Reason 1: Consolidates the number of unique configuration counts (and thus increases team mastery of those fewer configurations)
- Reason 2: Ensures fully integrated quality assurance tests and security verifications
- Updates and even non-emergency patches are then rolled into a new a "golden build" which is then applied to production hardware as a new build
- Reason 1: Eliminates the risk of "patch and pray"
- Reason 2: Otherwise, over time, break/fix cycles tend to encourage configuration variance between production and pre-production servers ... and between similar servers that should be identical
- Reason 3: Applying new builds allows for highly accurate predictions of downtime, reduces chances of human error, and is typically faster than applying numerous individual patches and updates
- As a general rule, installed build packages will be preceded by erasing the production hard drive (or partition) ... the book calls this a "bare-metal build"
- Why? This process ensure that production servers do not contain any hidden dependencies, and guarantees that the "golden builds" accurately reflect production systems, enabling perfect synchronization with pre-production servers
Phase Four: Enable Continuous Improvement
(p59-64)
Goals
- Continuous increase in technical mastery of production infrastructure by reducing configuration variance
- Continuous improvement of change success rates
- Continuous increases in effective rate of change
- Continuous monitoring to avoid slips in performance
Recommended key steps
- Use recommended metrics to hone efforts from the first three Phases. A few selected examples:
- Percent of systems that match known good builds (higher is better)
- Time to provision known good builds (lower is better)
- Percent of builds that have security sign off (higher is better)
- Number of authorized changes per week (higher is better)
- Change success rate (higher is better)
- Strive to implement additional recommended improvement points. A few selected examples:
- Segregate the development, test, and production systems to safeguard against any possible unintentional crossovers or hidden dependencies
- Enforce a standard build across all similar devices
- Define bullet-proof back out processes to recover from failed or unauthorized changes
- Internalize the fundamental relationship between Mean Time to Repair (MTTR) and availability. By improving MTTR you also improve overall availability.
- Track repeat offenders who circumvent change management policies.
Visible Ops的更多相关文章
- 关于DevOps你必须知道的11件事
转自:http://www.infoq.com/cn/articles/11devops 关于作者 Gene Kim在多个角色上屡获殊荣:CTO.研究者和作家.他曾是Tripwire的创始人并担任了1 ...
- hdu2848 Visible Trees (容斥原理)
题意: 给n*m个点(1 ≤ m, n ≤ 1e5),左下角的点为(1,1),右上角的点(n,m),一个人站在(0,0)看这些点.在一条直线上,只能看到最前面的一个点,后面的被档住看不到,求这个人能看 ...
- display:none与visible:hidden的区别 slideDown与
display:none与visible:hidden的区别 display:none和visible:hidden都能把网页上某个元素隐藏起来,但两者有区别: display:none ---不为被 ...
- toArray(),toJson(),hidden([ ]),visible([ ])
toArray() 转换为数组,hidden()不输出的字段 public function index(){ $user = model('User'); $data = $user::)-> ...
- 窗体Showmedol 遇到的奇怪异常: cannot make a visible window model
//窗体Showmedol 遇到的奇怪异常: cannot make a visible window model //背景:ShowModal A窗体,A窗体再ShowModal B窗体:A是透明背 ...
- display:none与visible:hidden的区别
display:none和visible:hidden都能把网页上某个元素隐藏起来,但两者有区别: display:none ---不为被隐藏的对象保留其物理空间,即该对象在页面上彻底消失,通俗来说就 ...
- 关于Delphi错误:Cannot make a visible window modal
Delphi的fsMDIChild类型的窗体是不能使用ShowModal的,否则会弹出"Cannot make a visible window modal"异常, 但是把fsMD ...
- Android笔记——Android中visibility属性VISIBLE、INVISIBLE、GONE的区别
在Android开发中,大部分控件都有visibility这个属性,其属性有3个分别为"visible "."invisible"."gone&quo ...
- Textbox.Visible=False隐藏方式导致的问题
今天公司的正式环境有个功能不好使,但是测试环境没有问题,经过和同事的研讨,发现应该是我在写代码的时候把Textbox的visible属性设置为false导致的. 当时的需求是需要在发邮件的时候加上“相 ...
随机推荐
- 问题:C#打开一个文本文档往里面写数据,没有就新建文档 ;结果:c#FileStream文件读写(转)
FileStream对象表示在磁盘或网络路径上指向文件的流.这个类提供了在文件中读写字节的方法,但经常使用StreamReader或 StreamWriter执行这些功能.这是因为FileStream ...
- doker 笔记(1) 架构
Docker 的核心组件包括: Docker 客户端 - Client Docker 服务器 - Docker daemon Docker 镜像 - Image Registry Docker 容器 ...
- hadoop再次集群搭建(1)-安装系统
从8月份到现在12月份,中间有四个月的时间没有学习hadoop系统了.其实适应新的环境,到现在一切尘埃落定,就应该静下心来,好好学习一下hadoop以及我之前很想学习的mahout.个人对算法比较感兴 ...
- Windows平台上通过git下载github的开源代码
常见指令整理: (1)检查ssh密钥是否已经存在.GitBash. 查看是否已经有了ssh密钥:cd ~/.ssh.示例中说明已经存在密钥 (2)生成公钥和私钥 $ ssh-keygen -t rsa ...
- Android中无标题样式和全屏样式学习
在进行UI设计时,我们经常需要将屏幕设置成无标题栏或者全屏.要实现起来也非常简单,主要有两种方法:配置xml文件和编写代码设置. 1.在xml文件中进行配置 在项目的清单文件AndroidManife ...
- 第4章 springboot热部署 4-1 SpringBoot 使用devtools进行热部署
/imooc-springboot-starter/src/main/resources/application.properties #关闭缓存, 即时刷新 #spring.freemarker.c ...
- Hbuilder实用技巧(转)
Hbuilder实用技巧 原创 2016年05月19日 10:25:42 标签: hbuilder 操作 16551 1. Q:怎么实现代码追踪? A:在编辑代码时经常会出现需要跳转到引用文件或者变量 ...
- PCL—关键点检测(Harris)低层次点云处理
博客转载自:http://www.cnblogs.com/ironstark/p/5064848.html 除去NARF这种和特征检测联系比较紧密的方法外,一般来说特征检测都会对曲率变化比较剧烈的点更 ...
- Python程序设计2——列表和元组
数据结构:更好的说法是从数据角度来说,结构化数据,就是说数据并不是随便摆放的,而是有一定结构的,这种特别的结构会带来某些算法上的性能优势,比如排序.查找等. 在Python中,最基本的数据结构是序列( ...
- CSS相关知识和经验的碎片化记录
1.子DIV块中设置margin-top时影响父DIV块位置的问题 解决办法1:若子DIV块中使用margin-top,则在父DIV块中添加:overflow:hidden; 解决办法2:在子DIV块 ...