Recently we looked across some of the most common behaviors that our community of 25,000 users looked for in their logs with a particular focus on web server logs. In fact our research identified the top 15 web server tags and alerts created by our customers – you can read more about these from in ourcommunity insights section – and you can also easily create tags or alerts based on the patterns to identify these behaviours in your systems.

This week we are focusing on performance analysis using log data. Again we looked across our community of over 25,000 users and identified 5 ways in which people use log data to analyze system performance. As always customer data was anonymized and privacy protected. Over the course of the next week we will be diving into each of these area’s in more detail and will feature customers first hand accounts of how they are using logs to help identify and resolve such issues in their systems.

Our research looked at more than 200k patterns from across our Community  to identify important events in their log data. With a particular focus on performance related issues we identified the following 5 areas as trending and common across our user base.:

1. Slow Response Times:Response times are one of the most common and useful performance measures that are available from your log data. They give you an immediate understanding of how long a request is taking to be returned. For example web server logs can give you insight into how long a request takes to return a response to a client device. This can include time taken for the different components behind your web server (application servers, DBs) to process the request so it can give an immediate view as to how well your application is performing. Recording response times from the client device/broswer can give you an even more complete picture since it also captures page load time in the app/browser as well as network latency.

A good rule of thumb when measuring response times is to follow the 3 response time limitsas outlined by Jakob Nielsen in his publication on ‘Usability Engineering’ back in 1993 that is still relevant today. In short 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, 1.0 second is about the limit for the user’s flow of thought to stay uninterrupted, and 10 seconds is about the limit for keeping the user’s attention focused on the dialogue.

Slow response time patterns almost always follow the pattern below:

  • response_time>X

Where response_time is the field value representing the server or client’s response and ‘X’ is a threshold, which if exceeded, you want the event to be highlighted or a notification to be sent so that you and your team are aware that somebody is having a poor user experience.

2. Memory Issues and Garbage Collection: Outofmemory errors can be pretty catastrophic when they occur as they often result in the application crashing due to lack of resources. Thus you want to know about these when they occur and creating tags and generating notifications via alerts when these events occur is always recommended.

However a leading indicator of outofmemory issues can be your garbage collection behavior, thus tracking this and getting notified if heap used vs free heap space is over a particular threshold, or if garbage collection is taking a long time can be particularly useful and can often point you in the direction of memory leaks. Identifying a memory leak before an out of memory exception can be the difference between a major system outage and a simple server restart until the issue is patched.

Furthermore slow or long garbage collection can also be one of the reasons for user’s experiencing slow application behavior as during garbage collection your system can slow down or in some situations it blocks until garbage collection is complete (e.g. with ‘stop the world’ garbage collection).

Below are some examples of common patterns used to identify some of the memory related issues outlined above:

  • Out of memory
  • exceeds memory limit
  • memory leak detected
  • java.lang.OutOfMemoryError
  • System.OutOfMemoryException
  • memwatch:leak: Ended heapDiff
  • GC AND stats

3. Deadlocks and Threading Issues

Deadlocks can occur in many shapes and sizes and can have pretty bad effects when they occur – everywhere from bringing your system to a complete halt to simply slowing it down. In short, a deadlock is a situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does. For example, we say that a set of processes or threads is deadlocked when each thread is waiting for an event that only another process in the set can cause.

Not surprisingly deadlocks feature as one of our top 5 performance related issues that our users write patterns to detect in their systems.

Most deadlock patterns simply contain the keyword ‘deadlock’, but some of the common patterns follow the following structure:

  • ‘deadlock’
  • ‘Deadlock found when trying to get lock’
  • ‘Unexpected error while processing request: deadlock;’

4. High Resource Usage  (CPU/Disk/ Network)

In many cases a slow down in system performance may not be as a result of any major software flaw, but can be a simple case of the load on your system increasing, yet not having increased resources available to deal with this. Tracking resource usage can allow you to see when you require additional capacity such that you can kick off more server instances for example.

Example patterns used when analysing resource usage:

  • metric=/CPUUtilization/ AND minimum>X
  • cpu>X
  • disk>X
  • disk is at or near capacity
  • not enough space on the disk
  • java.io.IOException: No space left on device
  • insufficient bandwidth

5. Database Issues and Slow Queries

Knowing when a query failed can be useful as it allows you to identify situations when a request may have returned without the relevant data and thus helps you identify when users are not getting the data they need. However more subtle issues can be when a user is getting the correct results but the results are taking a long time to return and while technically the system may be fine and bug free a slow user experience may be hurting your top line.

Tracking slow queries allows you to track how your DB queries are performing. Setting acceptable thresholds for query time and reporting on anything that exceeds these thresholds can help you quickly identify when your users experience is being effected.

Example patterns:

  • SqlException
  • SQL Timeout
  • Long query
  • Slow query
  • WARNING: Query took longer than X
  • Query_time > X

As always let us know if you think we have left out any important issues that you like to track in your logs. To start tracking your own system performance, create a free accountand include these patterns listed above to automatically create tags and alerts relevant for your system.

 

Published at DZone with permission of Trevor Parsons, author and DZone MVB. (source)

http://java.dzone.com/articles/5-ways-use-log-data-analyze?mz=110215-high-perf

5 Ways to Use Log Data to Analyze System Performance--reference的更多相关文章

  1. 学习笔记:Analyze MySQL Performance及慢日志的开启

    Table of Contents Analyze MySQL PerformanceTuningSlow queries and Slowlog Brought to you by Rick Jam ...

  2. 错误描述:请求“System.Data.SqlClient.SqlClientPermission, System.Data, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089”类型的权限已失败

    错误描述:请求“System.Data.SqlClient.SqlClientPermission, System.Data, Version=2.0.0.0, Culture=neutral, Pu ...

  3. System.Data.Dbtype转换为System.Data.SqlDbType

    最近在做一些OM Mapping的准备工作,新学了一招. 如果要将System.Data.Dbtype转换为System.Data.SqlDbType,以前以为要写Switch Case语句.其实有很 ...

  4. 【译】Using .NET for Apache Spark to Analyze Log Data

    .NET for Spark可用于处理成批数据.实时流.机器学习和ad-hoc查询.在这篇博客文章中,我们将探讨如何使用.NET for Spark执行一个非常流行的大数据任务,即日志分析. 1 什么 ...

  5. Note: OBLIVIATE: A Data Oblivious File System for Intel SGX

    OBLIVIATE redesigned ORAM for SGX filesystem operations for confuse access patterns to protect user ...

  6. 错误 1 “System.Data.DataRow.DataRow(System.Data.DataRowBuilder)”不可访问,因为它受保护级别限制

    new DataRow 的方式: DataTable pDataTable = new DataTable(); DataRow pRow = new DataRow(); 正确的方式: DataRo ...

  7. Chapter 20: Diagnostics

    WHAT'S IN THIS CHAPTER?n Code contractsn Tracingn Event loggingn Performance monitoringWROX.COM CODE ...

  8. SQL Server ErrorLog

    SQL Server 使用ErrorLog记录SQL Server启动和运行过程中的信息,具体信息参考:<SQLSERVER errorlog讲解>.通常来说,ErrorLog是指SQL ...

  9. GoldenGate配置(二)之双向复制配置

     GoldenGate配置(二)之双向复制配置 环境: Item Source System Target System Platform Red Hat Enterprise Linux Serve ...

随机推荐

  1. JS代码判断IE6,IE7,IE8,IE9的函数代码

    JS代码判断浏览器版本,支持IE6,IE7,IE8,IE9!做网页有时候会用到JS检测IE的版本,下面是检测Microsoft Internet Explorer版本的三种代码 做网页有时候会用到JS ...

  2. Django关于filter和get()方法

    首先引入一个问题: 问: card = Card.objects.filter(pk=offline_card_id).get() card = Card.objects.get(pk=offline ...

  3. 解决DBCP报错 Could not retrieve transation read-only s

    dbcp连接池报错 commons-dbcp 解决Mysql Cannot get a connection, pool error:  Could not create a validated ob ...

  4. 关于ARP欺骗与MITM(中间人攻击)的一些笔记( 二 )

    一直没有折腾啥东西,直到最近kali Linux发布,才回想起应该更新博客了….. 再次说明,这些技术并不是本人原创的,而是以前记录在Evernote的旧内容(排版不是很好,请谅解),本文是继关于AR ...

  5. HDU 1394 Minimum Inversion Number(线段树的单点更新)

    点我看题目 题意 :给你一个数列,a1,a2,a3,a4.......an,然后可以求出逆序数,再把a1放到an后,可以得到一个新的逆序数,再把a2放到a1后边,,,,,,,依次下去,输出最小的那个逆 ...

  6. 李洪强iOS开发Swift篇—04_运算符

    李洪强iOS开发Swift篇—04_运算符 一.运算符 1.Swift所支持的部分运算符有以下一些 赋值运算符:= 复合赋值运算符:+=.-= 算术运算符:+.-.*./ 求余运算符:% 自增.自减运 ...

  7. android利用剪切板来实现数据的传递

    在Android开发中我们经常要遇到的一个问题就是数据在不同的Activity之间的共享.在Android开发中有很多种方法可以达到这个目地. 这里介绍一种比较常见.又常用的一种方法就是使用剪切板.我 ...

  8. 【踩坑记】从HybridApp到ReactNative

    前言 随着移动互联网的兴起,Webapp开始大行其道.大概在15年下半年的时候我接触到了HybridApp.因为当时还没毕业嘛,所以并不清楚自己未来的方向,所以就投入了HybridApp的怀抱. Hy ...

  9. Default Web Site

    win7上设置默认website的物理路径

  10. Cocos2d-x 坑之一:Xcode文件真实目录与工程视图目录

    Cocos2d-x一定要保证 Xcode文件真实目录与工程视图目录 的一致性,不然,会出现文件读取不了,或include不了的情况. 如果出现此类情况,优先查看真实目录的结构.