原文地址:https://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications

High-performance applications form the backbone of the modern web. At LinkedIn, a number of internal high-throughput services cater to thousands of user requests per second. For optimal user experience, it is very important to serve these requests with low latency.

For example, a product our members use regularly is the Feed - a constantly updating list of professional activities and content. Examples of various feeds across LinkedIn include those in company pages, school pages, and most importantly - the homepage feed. The underlying feed data platform that indexes these updates from various entities in our economic graph (members, companies, groups, etc.) has to serve relevant updates with high throughput and low latency.

Figure 1: LinkedIn Feeds

For taking these types of high-throughput, low-latency Java applications to production, developers have to ensure consistent performance at every stage of the application development cycle. Determining optimal Garbage Collection (GC) settings is critical to achieve these metrics.

This blog post will walk through the steps to identify and optimize GC requirements, and is intended for a developer interested in a systematic method to tame GC to obtain high throughput and low latency. Insights in this post were gathered while building the next generation feed data platform at LinkedIn. These insights include, but are not limited to, CPU and memory overheads of Concurrent Mark Sweep (CMS) and G1 collectors, avoiding incessant GC cycles due to long-lived objects, performance changes obtained by optimizing task assignment to GC threads, and OS settings that are needed to make GC pauses predictable.

When is the right time to optimize GC?

GC behavior can vary with code-level optimizations and workloads. So it is important to tune GC on a codebase that is near completion and includes performance optimizations. But it is also necessary to perform preliminary analysis on an end-to-end basic prototype with stub computation code and synthetic workloads representative of production environments. This will help capture realistic bounds on latency and throughput based on the architecture and guide the decision on whether to scale up or scale out.

During the prototyping phase of our next generation feed data platform, we implemented almost all of the end-to-end functionality and replayed query workload being served by the current production infrastructure. This gave us enough diversity in the workload characteristics to measure application performance and GC characteristics over a long enough period of time.

Steps to optimize GC

Here are some high level steps to optimize GC for high-throughput, low-latency requirements. Also included, are the details of how this was done for the feed data platform prototype. We saw the best GC performance with ParNew/CMS, though we also experimented with the G1 garbage collector.

1. Understand the basics of GC

Understanding how GC works is important because of the large number of variables that need to be tuned. Oracle's whitepaper on Hotspot JVM Memory Management is an excellent starting point to become familiar with GC algorithms in Hotspot JVM. To understand the theoretical aspects of the G1 collector, check out this paper.

2. Scope out GC requirements

There are certain characteristics of GC that you should optimize, to reduce its overhead on application performance. Like throughput and latency, these GC characteristics should be observed over a long-running test to ensure that the application can handle variance in traffic while going through multiple GC cycles.

  • Stop-the-world collectors pause the application threads to collect garbage. The duration and frequency of these pauses should not adversely impact the application's ability to adhere to the SLA.
  • Concurrent GC algorithms contend with the application threads for CPU cycles. This overhead should not affect the application throughput.
  • Non-compacting GC algorithms can cause heap fragmentation, which leads to long stop-the-world pauses due to full GC. The heap fragmentation should be kept to a minimum.
  • Garbage collection needs memory to work. Certain GC algorithms have a higher memory footprint than others. If the application needs a large heap, make sure the GC's memory overhead is not large.
  • A clear understanding of GC logs and commonly used JVM parameters is necessary to easily tune GC behavior should the code complexity grow or workload characteristics change.

We used Hotspot Java7u51 on Linux OS and started the experiment with 32 GB heap, 6 GB young generation, and a -XX:CMSInitiatingOccupancyFraction value of 70 (percentage of the old generation that is filled when old GC is triggered). The large heap is required to maintain an object cache of long-lived objects. Once this cache is populated, the rate of object promotions to the old generation drops significantly.

With the initial GC configuration, we incurred a young GC pause of 80 ms once every three seconds and 99.9th percentile application latency of 100 ms. The GC behavior that we started with is likely adequate for many applications with less stringent latency SLAs. However, our goal was to decrease the 99.9th percentile application latency as much as possible. GC optimization was essential to achieve this goal.

3. Understand GC metrics

Measurement is always a prerequisite to optimization. Understanding the verbose details of GC logs (with these options: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime) helps with the overall understanding of the application's GC characteristics.

At LinkedIn, our internal monitoring and reporting systems, inGraphs and Naarad, generate useful visualizations for metrics such as GC stall time percentiles, max duration of a stall, and GC frequency over long durations. In addition to Naarad, there are a number of open source tools like gclogviewer that can create visualizations from GC logs.

At this stage, it is possible to determine whether GC frequency and pause duration are impacting the application's ability to meet latency requirements.

4. Reduce GC frequency

In generational GC algorithms, collection frequency for a generation can be decreased by (i) reducing the object allocation/promotion rate and (ii) increasing the size of the generation.

In Hotspot JVM, the duration of young GC pause depends on the number of objects that survive a collection and not on the young generation size itself. The impact of increasing the young generation size on the application performance has to be carefully assessed:

  • Increasing the young generation size may lead to longer young GC pauses if more data survives and gets copied in the survivor spaces, or if more data gets promoted to the old generation in each young collection. Longer GC pauses may lead to an increased application latency and/or reduced throughput.
  • On the other hand, the pause duration may not increase if the number of objects that survive each collection does not increase substantially. In this case, the reduction in GC frequency may lead to a reduction in overall application latency and/or an increase in the throughput.

For applications that mostly create short-lived objects, you will only need to control the aforementioned parameters. For applications that create long-lived objects, there is a caveat; the promoted objects may not be collected in an old generation GC cycle for a long time. If the threshold at which old generation GC is triggered (expressed as a percentage of the old generation that is filled) is low, the application can get stuck in incessant GC cycles. You can avoid this by triggering GC at a higher threshold.

As our application maintains a large cache of long-lived objects in the heap, we increased the threshold of triggering old GC by setting: -XX:CMSInitiatingOccupancyFraction=92 -XX:+UseCMSInitiatingOccupancyOnly. We also tried to increase the young generation size to reduce the young collection frequency, but reverted this change as it increased the 99.9th percentile application latency.

5. Reduce GC pause duration

The young GC pause duration can be reduced by decreasing the young generation size as it may lead to less data being copied in survivor spaces or promoted per collection. However, as previously mentioned, we have to observe the impact of reduced young generation size and the resulting increase in GC frequency on the overall application throughput and latency. The young GC pause duration also depends on tenuring thresholds and the old generation size (as shown in step 6).

With CMS, try to minimize the heap fragmentation and the associated full GC pauses in the old generation collection. You can get this by controlling the object promotion rate and by reducing the -XX:CMSInitiatingOccupancyFraction value to trigger the old GC at a lower threshold. For a detailed understanding of all options that can be tweaked and their associated tradeoffs, check out Tuning Java Garbage Collection for Web Services andJava Garbage Collection Distilled.

We observed that much of our Eden space was evacuated in the young collection and very few objects died in the survivor spaces over the ages three to eight. So we reduced the tenuring threshold from 8 to 2 (with option: -XX:MaxTenuringThreshold=2), to reduce the amount of time spent in data copying in young generation collection.

We also noticed that the young collection pause duration increased as the old generation was filling up; this indicated that the object promotion was taking more time due to backpressure from the old generation. We addressed this problem by increasing the total heap size to 40 GB and reducing -XX:CMSInitiatingOccupancyFraction value to 80 to kick off old collection sooner. Though the -XX:CMSInitiatingOccupancyFraction value was reduced, increasing the heap size ensured that we avoided the incessant old GCs. At this stage, we observed a young collection pause of 70 ms and 99.9th percentile latency of 80 ms.

6. Optimize task assignment to GC worker threads

To reduce the young generation pause duration even further, we decided to look into options that optimized task binding with GC threads.

The -XX:ParGCCardsPerStrideChunk option controls the granularity of tasks given to GC worker threads and helps get the best performance out of a patch written to optimize card table scan time spent in young generation collection. The interesting learning here was that the young GC time increases with the increase in the size of old generation. Setting this option to a value of 32768 brought our young collection pause down to an average of 50 ms. We also achieved a 99.9th percentile application latency of 60 ms.

There are also a couple of other interesting options that deal with mapping tasks to GC threads. -XX:+BindGCTaskThreadsToCPUs option binds GC threads to individual CPU cores if the OS permits it. -XX:+UseGCTaskAffinity allocates tasks to GC worker threads using an affinity parameter. However, our application did not see any benefit from these options. In fact, some investigation revealed that these options are NOPs on Linux systems [1, 2].

7. Identify CPU and memory overhead of GC

Concurrent GC typically increases CPU usage. While we observed that the default settings of CMS behaved well, the increased CPU usage due to concurrent GC work with the G1 collector degraded our application's throughput and latency significantly. G1 also tends to impose a larger memory overhead on the application as compared to CMS. For low-throughput applications that are not CPU-bound, high CPU usage by GC may not be a pressing concern.

Figure 2: CPU usage in % with ParNew/CMS and G1: the node with relatively variable CPU usage used G1 with the option -XX:G1RSetUpdatingPauseTimePercent=20

Figure 3: Requests served per second with ParNew/CMS and G1: the node with lower throughput used G1 with the option -XX:G1RSetUpdatingPauseTimePercent=20

8. Optimize system memory and I/O management for GC

Occasionally, it is possible to see GC pauses with (i) low user time, high system time and high real (wallclock) time or (ii) low user time, low system time and high real time. This indicates problems with the underlying process/OS settings. Case (i) might imply that the JVM pages are being stolen by Linux and case (ii) might imply that the GC threads were recruited by Linux for disk flushes and were stuck in the kernel waiting for I/O. Refer to thisslide deck to check which settings might help in such cases.

We used the JVM option -XX:+AlwaysPreTouch to touch and zero out a page at application startup to avoid the performance penalty at runtime. We also setvm.swappiness to zero so that the OS does not swap pages unless it is absolutely necessary.

You could potentially use mlock to pin the JVM pages to memory so that the OS does not swap them out. However, if the system exhausts all its memory and the swap space, the OS will kill a process to reclaim memory. Typically, Linux kernel picks the process that has a high resident memory footprint but has not been running for long (the workflow of killing a process in case of OOM). In our case, this process will most likely be our application. Graceful degradation is one of the nicer properties of a service and the possibility of the service's sudden demise does not augur well for operability — so we don't use mlock and rely on vm.swappiness to avoid the swap penalty as long as it is possible.

GC optimization for the feed data platform at LinkedIn

For the prototype feed data platform system, we tried to optimize garbage collection using two algorithms with the Hotspot JVM:

  • ParNew for the young generation collection, CMS for the old generation collection.
  • G1 for the young and old generations. G1 tries to solve the problem of having stable and predictable pause time below 0.5 seconds for heap sizes of 6GB or larger. In the course of our experiments with G1, despite tuning various parameters, we could not get the same GC performance or pause predictability as with ParNew/CMS. We also observed a bug [3] related to memory leak while using G1, but we are not sure of the root cause.

With ParNew/CMS, we saw 40-60 ms young generation pauses once every three seconds and a CMS cycle once every hour. We used the following JVM options:

// JVM sizing options
-server -Xms40g -Xmx40g -XX:MaxDirectMemorySize=4096m -XX:PermSize=256m -XX:MaxPermSize=256m
// Young generation options
-XX:NewSize=6g -XX:MaxNewSize=6g -XX:+UseParNewGC -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=8 -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=32768
// Old generation options
-XX:+UseConcMarkSweepGC -XX:CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly
// Other options
-XX:+AlwaysPreTouch -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:-OmitStackTraceInFastThrow

With these options, our application's 99.9th percentile latency reduced to 60 ms while providing a throughput of thousands of read requests.

Acknowledgements

Many hands contributed to the development of the prototype application: Ankit Gupta,Elizabeth BennettRaghu HiremagalurRoshan SumbalySwapnil GhikeTom Chiang, and Vivek Nelamangala. Many thanks to Cuong TranDavid Hoa, and Steven Ihde for their help with system optimizations.

Footnotes

[1] -XX:+BindGCTaskThreadsToCPUs seems to be a NOP on Linux systems because thedistribute_processes method in hotspot/src/os/linux/vm/os_linux.cpp is not implemented in JDK7 or JDK8.

[2] The -XX:+UseGCTaskAffinity option seems to be a NOP on all platforms in JDK7 and JDK8 as the task affinity appears to be always set to sentinel_worker = (uint) -1. The source code is inhotspot/src/share/vm/gc_implementation/parallelScavenge/{gcTaskManager.cpp, gcTaskThread.cpp, gcTaskManager.hpp}.

[3] There have been a few bugs related to memory leaks with G1; it is possible that Java7u51 might have missed some fixes. For example, this bug was fixed only in Java 8.

Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications--转载的更多相关文章

  1. Optimizing web servers for high throughput and low latency

    转自:https://blogs.dropbox.com/tech/2017/09/optimizing-web-servers-for-high-throughput-and-low-latency ...

  2. Java Garbage Collection Basics--转载

    原文地址:http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html Overview Purpose ...

  3. How Garbage Collection Really Works

    Java Memory Management, with its built-in garbage collection, is one of the language's finest achiev ...

  4. [Java] 垃圾回收 ( Garbage Collection ) 的步骤演示

    关于 JVM 垃圾回收机制的基础内容,可参考上一篇博客 垃圾回收机制 ( Garbage Collection ) 简介 上一篇博客,介绍了堆的内存被分为三个部分:年轻代.老年代.永生代.这篇博文将演 ...

  5. 深入理解Java中的Garbage Collection

    前提 最近由于系统业务量比较大,从生产的GC日志(结合Pinpoint)来看,需要对部分系统进行GC调优.但是鉴于以往不是专门做这一块,但是一直都有零散的积累,这里做一个相对全面的总结.本文只针对Ho ...

  6. [转] 分代垃圾回收的 新旧代引用问题(原标题:Back To Basics: Generational Garbage Collection)

    原文链接: https://blogs.msdn.microsoft.com/abhinaba/2009/03/02/back-to-basics-generational-garbage-colle ...

  7. 2.5 – Garbage Collection 自动垃圾回收 Stop-the-world vs. incremental vs. concurrent 垃圾回收策略

    2.5 – Garbage Collection  自动垃圾回收 Lua 5.3 Reference Manual http://www.lua.org/manual/5.3/manual.html# ...

  8. [翻译]Java垃圾收集精粹(Java Garbage Collection Distilled)

    source URL: http://www.infoq.com/articles/Java_Garbage_Collection_Distilled Name: Java Garbage Colle ...

  9. The Impact of Garbage Collection on Application Performance

    As we’ve seen, the performance of the garbage collector is not determined by the number of dead obje ...

随机推荐

  1. react基础课程一简述JSX及目录关系

    简述JSX及目录关系 简述:它被称为JSX,它是JavaScript的语法扩展,JSX是一种模板语言,但它具有JavaScript的全部功能.所以学习jsx还是需要学习基础的javaScript的. ...

  2. mkdi---创建目录。

    mkdir命令用来创建目录.该命令创建由dirname命名的目录.如果在目录名的前面没有加任何路径名,则在当前目录下创建由dirname指定的目录:如果给出了一个已经存在的路径,将会在该目录下创建一个 ...

  3. 【VC++学习笔记一】MFC操作Excel

    最近在做一个读取Excel的功能,之前也做过相关的,但总是零零闪闪的,趁着正在劲头上,归纳一下: 利用Automation添加相关的类,在Excel2010中可以在安装文件夹下直接点击Excel.ex ...

  4. promise的弊端

    promise的弊端 promise彻底解决了callback hell,但也存在以下一些问题 延时问题(涉及到evnet loop)(http://www.ruanyifeng.com/blog/2 ...

  5. Nginx+tomcat+ssl免费证书配置

    0.说明 本文说描写叙述的方式是用nginx的443重定向到tomcat的8443,nginx的80port重定到tomcat的8080: 乱入:个人标记:caicongyang 1.nginx安装 ...

  6. Linux监控(OS,MySQL,Tomcat,Apache)

    关于逐步脱离开发岗位的意见,老大已经批准了,接下来我的主要工作就是"运维+数据库管理".感谢杰民兄和小马哥能接受我的骚扰.接下来还会去骚扰他们,同一时候也会去骚扰董大爷,小刚总,心 ...

  7. 用Shell脚本过滤Hadoop中不能訪问的节点

    近期使用的一个集群hp1,由于维护集群的人不给力.节点总是过一段时间就掉一两个.今天发现重新启动hadoop时,HDFS已经进入保护模式了. 决定把slaves节点中的无法訪问的节点所有过滤掉.所以写 ...

  8. #学习笔记#——JavaScript 数组部分编程(二)

    2.移除数组 arr 中的所有值与 item 相等的元素,直接在给定的 arr 数组上进行操作,并将结果返回 function removeWithoutCopy(arr, item) { if(!A ...

  9. WPF通用管理框架 项目客户端基础结构介绍

    介绍 首先, 粗糙的展示一下目前的结构设计理念, 因为这几天一直在忙于工作, 所以跟进有些缓慢, 整体的设计是支持多种服务模式.目前只针对MSSQL做数据库接口, ORM选型则用的是微软的EF(PS: ...

  10. linux/unix 基本概念的认识(sha-bang 、PPA)

    PPA:Personal Package Archives : Ubuntu: 比如为安装 emacs,需要首先添加某个PPA: sudo add-apt-repository ppa:cassou/ ...