原文地址:https://dzone.com/articles/7-java-performance-metrics-to-watch-after-a-major-1

The Java performance metrics you need to follow for understanding how your application behaves in production.

Unlike the days when software used to be shipped in boxes and there was no way of knowing how it will perform in production, today almost any metric you can think of can be tracked down and reported. The problems we’re now dealing with are coming from information overload and scale, rather than not having enough information. With tens or hundreds of servers at play, this becomes even harder to keep track of. One thing that remained from those boxed software days are logs, which stayed pretty much the same for over 20 years now. Most developers still depend on those for insights into their production systems, but now they’re gradually being replaced.

For this post we’ve decided to gather some of the most insightful metrics you can follow to understand how your application behaves in production, WITHOUT relying on log files in any way. Aside from external factors like user loads (or… AWS downtime), new deployments are probably the most common influence on how Java performance indicators behave. So following up on them becomes even more critical right at those sensitive times following new deployments.

If it Has Numbers in it, Then it Must Be True!

Before we move on to discussing each metric, let’s highlight one major caveat. There’s this notion that if you back yourself with data, then you must be right. The problem here is that it’s really easy to misrepresent data. Much easier than to prove it wrong when presented to you. Let’s distinguish here between measures coming from looking at simple time-series data, seeing how a certain basic metric behaves over time, versus looking at the data from a different angle and keeping your performance percentiles in check. The bottom line is that we need to be mindful to the impact of the metrics we care about, and have some sanity check to assess them.

For example, say we’re looking at median / 50th percentile transaction response time. A popular indicator that many companies use as one of their main KPIs. In practice, when a single pageview has tens or more of these requests (usually well over 40), it means that the user is 99.999…% likely to experience a result worse than the median (It’s simple math: 1 – (0.5 ^ 40)). So which percentile does it make sense to focus on? Even if we’re looking at the 95th percentile, since you probably have well over 40 requests per page, most of your users will experience a response even worse than that. Multiply over several pageviews, and it gets even tougher. To read more about percentiles and just how misleading data, check out Gil Tene’s blog right here.

Now, let’s take a closer look at our metrics of choice, see exactly what do they stand for, and how you can get a hold of them:

1. Response Times and Throughput

The application response time measures how long does it take for the transactions in your application to complete. It can also be looked at from the HTTP requests level, or say, the database level. Allowing you to narrow down on the slowest queries that may need some help with optimization. The Throughput indicator looks at transactions from another angle, and displays how many requests your application is processing at any given time, usually per minute (rpm).

One way to measure this is using APMs like New Relic or AppDynamics (which we compared head to head on a previous blog post). On these kind of tools you can follow up on the average response time and compare it to that of yesterday’s or last week’s straight from the main reporting dashboard. This helps us see how new deployments affect our application’s health. Another view allows you to look at the web transaction percentiles, measuring how long does it take to HTTP requests to complete.

It’s also possible to monitor this in-house, but may require hard coding, like sending out data with Dropwizard metrics and publishing it to Graphite. It seems though that the most useful insights come when you correlate this data with other metrics. More on that in the following measures we’re covering here.

Takeaway #1: Make sure the collection methods you use allow you to look at the data from different angles and get down to the percentile level.

Tools to check:
1. AppDynamics
2. New Relic
3. Ruxit


Web transaction percentiles and throughput reporting in New Relic

2. Load Average

The second metric we extensively follow is the Load Average on our servers. Load Average is a metric that’s traditionally divided to 3, showing its result for the last 1, 5 and 15 minutes (left to right). As long as your score is under the number of cores your machine has – You’re in the clear. Once it goes over the number of cores. it means your machine is under stress.

Beyond the simple measure of CPU utilization, Load Average takes into account how many processes each core has in its queue. A state where a core is 100% utilized, but will soon finish up with a task, versus a state where it has 6 more tasks in the queue is quite different. CPU utilization alone doesn’t cover that, but load average takes the bigger picture into account.

An awesome way to follow up on your server load average on linux is with htop by Hisham Muhammad. Great colorful and live visualization that makes your command line feel much like a NASA dashboard

Takeaway #2: Utilization of a resource is not enough to determine its load, you need to be mindful of the tasks in its queue to be fully informed.

Tools to check:
1. htop


Running htop to examine the loads on one of our servers, load average appears on the top right

3. Error Rates (and How to Solve Them)

There are several different ways to look at error rates, and most developers go with the high level metrics – Looking at error rates at the whole application level, total failed HTTP transactions out of the overall HTTP requests for instance. But there’s an often overlooked in-depth layer to this with immediate implications for your application health: Error rates for specific transactions. Showing the number of times a certain method in your code fails and produces a logged error or exception out of the overall times it has been called.


Error rate breakdown by specific events in Takipi, narrowing down on the root cause for a peak in error volume

But this data doesn’t mean much on its own, right? The second step after prioritizing the most urgent events that you should address, be it logged errors or exceptions, is to get down to their real root cause and fix them. And we’ve built a solution to this problem as well. With Takipi in play, you don’t need to pull up log files and start looking for clues. All the information about the state of the server is accessible from the same screen. This includes the stack trace, the actual source code, and the variable values, across multiple instances of each faulty call.

Takeaway #3: High level data is not enough to get down to the real root cause of increased error rates. You need to favor collection methods that produce the richest data about the metrics you care about.

Tools to check:
1. Takipi


Zooming on the error analysis down to the specific variables that caused each error

4. GC Rate and Pause Duration

A misbehaving garbage collector is one of the main reason that can cause your application throughput and response time to take a deep dive. So when digging in to find the cause for these symptoms, a common resolution would be that the application was in the middle of a stop the world GC pause. To learn more about the process of optimizing garbage collection and its associated metrics you can check out this post we published around solution strategies for GC issues.

The key to understanding the frequency and duration of GC pauses goes through analyzing GC log files. It’s not a metric you can get out of the box without analyzing it on your own, or using tools like jClarity. To analyze this you’ll need to make sure to turn on GC log collection with the appropriate JVM arguments.

Takeaway #4: Keep in mind to have a broad view correlate the data between different metrics to see how they affect each other.

Tools to check:
1. jClarity Censum
2. GCViewer

5. Business Metrics

The performance of your application doesn’t solely depend on how fast it responds, and neither on its error rate. The flipside is business metrics, and the responsibility for those is not at the hands of the product / sales people only. Measures like revenue, user counts, and interactions with specific areas of your application are critical for understanding how it performs. Having those side by side with the timestamps of new deployment is important to see how the fixes and new features you deploy impact the bottom line in terms of the business. Hopefully for the better of course, but if it goes for the worse – It’s super easy to know what needs fixing once you have all your data in one place.

Moreover, the ability to tie those business metrics, in real time, together with data about error rates and latency, is extremely powerful. This allows you to drill down to understand exactly which error or exception is causing you the most trouble, so you can prioritize them by their impact on business goals. Making sense of all the exceptions and log errors that are flying around. The way to do to this is using monitoring tools that are open to integrations and play well with the other kids in the neighborhood. This is why it’s super important to keep all data open and have the option to export it to our service of choice.

Say you’re using Graphite to centralize the business metrics that you’re reporting on, you’ll need the tool you’re using to be open to sending data to it. For example, the way our engineering team enabled this is by opening the metrics we report on to publishing through StatsD, so they can be then directed to any reporting dashboard our users choose to use.

Takeaway #5: Siloed data is a thing of the past. The methods you choose to pull metrics by should also let you correlate it with data from different sources.

Tools to check:
1. Grafana
2. The ELK stack
3. Datadog
4. Librato

6. Uptime and Service Health

This one metric sets the tone for the whole shebang. Beyond using it as an alerting medium, it also lets you define your SLAs over time. Seeing what percent of the time you’re providing a fully functioning service to your users.

The way we follow up on this is through a health check we run with a single servlet using Pingdom. The check looks into all of the services that take part in the transactions in our application, including the database and S3.

Takeaway #6: Uptime might be a binary indicator but there’s a lot of value in looking at in an aggregate matter to locate the weak spots in your stack.

Tools to check:
1. Pingdom


Monitoring uptime and application health in Pingdom

7. Log Size

All the metrics we’ve discussed so far skip logging altogether, well, except for GC logs. But we still can’t ignore logs altogether. A side effect of logs is that they never stop growing. If you don’t keep an eye on their size and the process you have in place for keeping them in check – Bad things can happen. When logs get loose, hard drives cry. Your server start filling up with junk, and everything slows down. So it’s important to keep a close eye on them. It’s a neverending source of havoc.

The most popular solution approach is partitioning the logs on the server using services like logstash and sending them out to storage with Splunk, ELK, other log management tools, or plain storage on S3 for example. Another way could be just to rollover or truncate them at some point, but then we’re risking information loss since like most developers, we haven’t cut our dependence on logs just yet.

Takeaway #7: Logs are a huge pain, especially since you’re being charged by the GBs if you’re having some external service take care of them for you. It’s time to rethink the problem and start reducing log sizes.

Final Thoughts

We see a trend in how data collection from applications in production is slowly moving away from complete reliance on log files. The new world of software analytics is more open, with smarter data that goes beyond plain numbers and holds rich contextual information. It’s exciting to to see how will it all turn out, and we look forward to building this new future together with you.

7 Java Performance Metrics to Watch After a Major Release--转的更多相关文章

  1. Java Performance Optimization Tools and Techniques for Turbocharged Apps--reference

    Java Performance Optimization by: Pierre-Hugues Charbonneau reference:http://refcardz.dzone.com/refc ...

  2. 老李分享:《Java Performance》笔记1——性能分析基础 1

    老李分享:<Java Performance>笔记1——性能分析基础   1.性能分析两种方法: (1).自顶向下: 应用开发人员通过着眼于软件栈顶层的应用,从上往下寻找性能优化的机会. ...

  3. Zipline Risk and Performance Metrics

    Risk and Performance Metrics 风险和性能指标 The risk and performance metrics are summarizing values calcula ...

  4. Performance Metrics(性能指标1)

    Performance Metrics(性能指标) 在我们开始旅行本书之前,我必须先了解本书的性能指标和希望优化后的结果,在第二章中,我们探索更多的性能检测工具和性能指标,可是,您得会使用这些工具和明 ...

  5. Java Performance - 如何调查解决 CPU 问题

    随着硬件的发展,往往服务器会配置足够的 CPUs, Java Server/服务器不太有 CPU 问题:但是偶尔因为 代码海量循环 或者 线程安全性(thread safe), 还是会带来 CPU 问 ...

  6. Java Performance - 如何调查解决内存问题

    JVM 的内存溢出/不足/OutOfMemoryError/垃圾收集恶性循环是需要解决,又是屡见不鲜的问题. 建议阅读官方的 Troubleshooting Guide for Java SE 6 w ...

  7. Java Performance - 优化和分析Garbage Collection/垃圾收集

    随着硬件的不断提升,Java Heap 越来越大,合理的垃圾收集调优变得愈发重要.下面介绍一些最佳实践: 注意: 下面不涉及 IBM AIX Java. 同时不介绍原理,仅仅是建议以及初始配置/最佳实 ...

  8. java performance

    http://www.oracle.com/technetwork/java/performance-138178.html# http://www.oracle.com/technetwork/ja ...

  9. [Java Performance] 数据库性能最佳实践 - JPA和读写优化

    数据库性能最佳实践 当应用须要连接数据库时.那么应用的性能就可能收到数据库性能的影响. 比方当数据库的I/O能力存在限制,或者因缺失了索引而导致运行的SQL语句须要对整张表进行遍历.对于这些问题.只相 ...

随机推荐

  1. C#变量引用与全局变量

    using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.T ...

  2. Android 学习笔记进阶14之像素操作

    在我们玩的游戏中我们会经常见到一些图像的特效,比如半透明等效果.要实现这种半透明效果其实并不难,需要我们懂得图像像素的操作. 不要怕,其实在Android中Bitmap为我们提供了操作像素的基本方法. ...

  3. 缓存函数memorize

    function mulity(x){ return x*x; } function memorize(f){ var cache = {}; var key = arguments.length + ...

  4. vue 使用同一组件,切换时不触发created、mounted钩子

    两个页面参数不同使用同一组件,默认情况下当这两个页面切换时并不会触发created或者mounted钩子. 方法一:通过watch $route的变化来做处理 watch: { $route() { ...

  5. tracepath---追踪并显示报文到达目的主机所经过的路由信息。

    tracepath命令用来追踪并显示报文到达目的主机所经过的路由信息. 语法 tracepath(参数) 参数 目的主机:指定追踪路由信息的目的主机: 端口:指定使用的UDP端口号.

  6. java——简单理解线程

    一·[概念]       一般来说,我们把正在计算机中运行的程序叫做"进程"(process),而不将其称为"程序"(program). 所谓"线程& ...

  7. error while loading shared libraries: libpcre.so.0的解决办法

    error while loading shared libraries: libpcre.so.0的解决办法 http://blog.csdn.net/xjkwq1qq/article/detail ...

  8. Android开发经验之获取画在画布上的字符串长度、宽度(所占像素宽度)

    Android中获取字符串长度.宽度(所占像素宽度) 计算出当前绘制出来的字符串有多宽,可以这么来! 方法1: Paint paint = new Paint(); Rect rect = new R ...

  9. Android学习笔记进阶十一图片动画播放(AnimationDrawable)

    大家平时见到的最多的可能就是Frame动画了,Android中当然也少不了它.它的使用更加简单,只需要创建一个 AnimationDrawabledF对象来表示Frame动画,然后通过addFrame ...

  10. C++ 为什么要virtual析构函数

    class A { public: A() { printf("A()\n"); } virtual ~A() { printf("~A()\n"); } }; ...