Slow Server? This is the Flow Chart You're Looking For--reference
Your high-powered server is suddenly running dog slow, and you need to remember the troubleshooting steps again. Bookmark this page for a ready reminder the next time you need to diagnose a slow server.
Get on "top" of it
Linux's top command provides a wealth of troubleshooting information, but you have to know what you're looking for. Reference this diagram as you go through the steps below:
Step 1: Check I/O wait and CPU Idletime
How: use top - look for "wa" (I/O wait) and "id" (CPU idletime)
Why: checking I/O wait is the best initial step to narrow down the root cause of server slowness. If I/O wait is low, you can rule out disk access in your diagnosis.
I/O Wait represents the amount of time the CPU waiting for disk or network I/O.Waiting is the key here - if your CPU is waiting, it's not doing useful work. It's like a chef who can't serve a meal until he gets a delivery of ingredients. Anything above 10% I/O wait should be considered high.
On the other hand, CPU idle time is a metric you WANT to be high -- the higher this is, the more bandwidth your server has to handle whatever else you throw at it. If your idle time is consistently above 25%, consider it "high enough"
Step 2: IO Wait is low and idle time is low: check CPU user time
How: use top again -- look for the %us column (first column), then look for a process or processes that is doing the damage.
Why: at this point you expect the usertime percentage to be high -- there's most likely a program or service you've configured on you server that's hogging CPU. Checking the % user time just confirms this. When you see that the % usertime is high, it's time to see what executable is monopolizing the CPU
Once you've confirmed that the % usertime is high, check the process list (also provided by top). Be default, top sorts the process list by %CPU, so you can just look at the top process or processes.
If there's a single process hogging the CPU in a way that seems abnormal, it's an anomalous situation that a service restart can fix. If there are are multiple processes taking up CPU resources, or it there's one process that takes lots of resources while otherwise functioning normally, than your setup may just be underpowered. You'll need to upgrade your server (add more cores), or split services out onto other boxes. In either case, you have a resolution:
- if situation seems anomalous: kill the offending processes.
- if situation seems typical given history: upgrade server or add more servers.
Step 3: IO wait is low and idle time is high
Your slowness isn't due to CPU or IO problems, so it's likely an application-specific issue. It's also possible that the slowness is being caused by another server in your cluster, or by an external service you rely on.
- start by checking important applications for uncharacteristic slowness (the DB is a good place to start),
- think through which parts of your infrastructure could be slowed down externally. For example, do you use an externally hosted email service that could slow down critical parts of your application?
If you suspect another server in your cluster, strace and lsof can provide information on what the process is doing or waiting on. Strace will show you which file descriptors are being read or written to (or being attempted to be read from) and lsof can give you a mapping of those file descriptors to network connections.
Step 4: IO Wait is high: check your swap usage
How: use top or free -m
Why: if your box is swapping out to disk a lot, the cache swaps will monopolize the disk and processes with legitimate IO needs will be starved for disk access. In other words, checking disk swap separates "real" IO wait problems from what are actually RAM problems that "look like" IO Wait problems.
An alternative to top is free -m
-- this is useful if you find top's frequent updates frustrating to use, and you don't have any console log of changes.
Step 5: swap usage is high
High swap usage means that you are actually out of RAM. See step 6 below.
Step 6: swap usage is low
Low swap means you have a "real" IO wait problem. The next step is to see what's hogging your IO.
How: iotop
iotop is an awesome tool for identifying io offenders. Two things to note:
- unless you've already installed iotop, it's probably not already on your system. Recommendation: install it before you need it -- it's no fun trying to install a troubleshooting tool on an overloaded machine.
- iotop requies a Linux of 2.62 or above
Step 7: Check memory usage
How: use top. Once top is running, press the M key - this will sort applications by the memory used.
Important: don't look at the "free" memory -- it's misleading. To get the actual memory available, subtract the "cached" memory from the "used" memory. This is because Linux caches things liberally, and often the memory can be freed up when it's needed. Read here (http://blog.scoutapp.com/articles/2010/10/06/determining-free-memory-on-linux) for more info.
Once you've identified the offenders, the resolution will again depend on whether their memory usage seems business-as-usual or not. For example, a memory leak can be satisfactorily addressed by a one-time or periodic restart of the process.
- if memory usage seems anomalous: kill the offending processes.
- if memory usage seems business-as-usual: add RAM to the server, or split high-memory using services to other servers.
A handy flow chart to tie it all together
Additional Tips
- vmstat is also a very handy tool, because it shows past values instead of an in-place update like top. Running
vmstat 1
shows concise metrics on memory, swap, io, and CPU every second. - Track your disk IO latency and compare to IOPS (I/O operations per second). Sometimes it's not activity in your own server causing the disk IO to be slow in a cloud/virtual environment. Proving this is hard, and you really want to have graphs of historical performance to show your provider!
- Increasing IO latency can mean a failing disk or bad sectors. Keep an eye on this before it escalates to data corruption or complete failure of the disk.
- If your a visual person, Scout's dashboards can help - your data will look like this:
Wrapping it up
Having concrete steps at your fingertips makes slow server troubleshooting a little easier. Top is a powerful tool that provides a wealth of metrics to help you narrow down the cause of server slowness. The metrics you'll be looking at are io wait, cpu idle %, user %, memory free (taking into account the file cache), and swap usage. Depending on whether conditions are a one-off or the result of growing demands on your infrastructure, you may be able to solve the slowdown by restarting services, or you may need to upgrade your servers. Historical context via Scout or a similar tool can be very useful in establishing what's normal for your machines.
原文:http://blog.scoutapp.com/articles/2014/07/31/the_slow_server_flow_chart
Slow Server? This is the Flow Chart You're Looking For--reference的更多相关文章
- ASP.NET-Web-API-Poster.pdf flow chart
下载地址
- Network problem solving flow chart
来自为知笔记(Wiz)
- Cheatsheet: 2014 08.01 ~ 08.31
Web Slow Server? This is the Flow Chart You're Looking For A Strolll Through Node: Introduction .NET ...
- Identity Server 4 - Hybrid Flow - 保护API资源
这个系列文章介绍的是Identity Server 4 的 Hybrid Flow, 前两篇文章介绍了如何保护MVC客户端, 本文介绍如何保护API资源. 保护MVC客户端的文章: https://w ...
- Microsoft SQL Server Version List [sqlserver 7.0-------sql server 2016]
http://sqlserverbuilds.blogspot.jp/ What version of SQL Server do I have? This unofficial build ch ...
- Microsoft SQL Server Version List(SQL Server 版本)
原帖地址 What version of SQL Server do I have? This unofficial build chart lists all of the known Servic ...
- Displaying Data in a Chart with ASP.NET Web Pages (Razor)
This article explains how to use a chart to display data in an ASP.NET Web Pages (Razor) website by ...
- 使用Identity Server 4建立Authorization Server (3)
预备知识: http://www.cnblogs.com/cgzl/p/7746496.html 第一部分: http://www.cnblogs.com/cgzl/p/7780559.html 第二 ...
- 【JavaScript】ESlint & Prettier & Flow组合,得此三神助,混沌归太清
Flow Flow的意义 Flow是faceBook开源的一个JavaScript静态类型检查工具,作用类似TypeScript,但是它不像TS那样是一门独立的语言,而是作为一个babel-plugi ...
随机推荐
- zTree异步生成数据时无法获取到子节点的选中状态
最近在项目中遇到一个问题,需求如下: 根据选中不同的人员(ID)向后台发送ajax请求,通过返回的数据来生成该人员的权限访问树,该树目录最少为3级目录,在生成的时候会自动勾选上次保存过的选中状态,点击 ...
- 【转载】详细解读C#中的 .NET 弱事件模式
你可能知道,事件处理是内存泄漏的一个常见来源,它由不再使用的对象存留产生,你也许认为它们应该已经被回收了,但不是,并有充分的理由. 在这个短文中(期望如此),我会在 .Net 框架的上下文事件处理中展 ...
- Socket原理
一.Socket简介 Socket是进程通讯的一种方式,即调用这个网络库的一些API函数实现分布在不同主机的相关进程之间的数据交换. 几个定义: (1)IP地址:即依照TCP/IP协议分配给本地主机的 ...
- JS拖动div的原理
要实现移动窗体,首先要捕获三个参数:1.a = 鼠标点击时的坐标.2.b = 被移动窗体的左顶点坐标.3.c = 鼠标移动时的坐标.然后还要算出你鼠标无论点击窗体哪个位置,移动改变的都是 (d = 窗 ...
- BZOJ 1033 杀蚂蚁
Description 最近,佳佳迷上了一款好玩的小游戏:antbuster.游戏规则非常简单:在一张地图上,左上角是蚂蚁窝,右下角是蛋糕,蚂蚁会源源不断地从窝里爬出来,试图把蛋糕搬回蚂蚁窝.而你的任 ...
- [BZOJ - 2463] [中山市选2009] 谁能赢呢?【“博弈论”】
题目链接:BZOJ - 2463 题目分析 这道题的题解是,由于两人都采取最优策略,所以最后一定所有格子都会被走到.(Why..表示不懂..哪位神犇可以给我讲一下QAQ) Upd:半群的神犇告诉我,并 ...
- myclips常用快捷键
-------------------------------------MyEclipse 快捷键1(CTRL)-------------------------------------Ctrl+1 ...
- 【号外号外:微软收购 .NET 的开源实现 Xamarin 项目的公司】
[首页小编:你好,关于博客园对Xamarin的报道确实一笔而过了,希望能不要把这篇文章移除首页呵呵,祝福帅气,聪明,敏捷,睿智的小编] 一个月后,微软开始免费Xamarin了....还要放开SDK.. ...
- sphinx插入css
使用role指令达到目的. We can put following lines at the beginning of our RST file to specify its style. .. r ...
- Perl ping
<pre name="code" class="html">use Net::Ping; $p = Net::Ping->new(" ...