This section provides solutions to some performance problems, and describes configuration best practices.

  Important:

If you are running CDH over 10Gbps Ethernet, improperly set network configuration or improperly applied NIC firmware or drivers can noticeably degrade performance. Work with your network engineers and hardware vendors to make sure that you have the proper NIC firmware, drivers, and configurations in place and that your network performs properly. Cloudera recognizes that network setup and upgrade are challenging problems, and will make best efforts to share any helpful experiences.

Disabling Transparent Hugepage Compaction

Most Linux platforms supported by CDH4 include a feature called transparent hugepage compaction which interacts poorly with Hadoop workloads and can seriously degrade performance.

Symptom: top and other system monitoring tools show a large percentage of the CPU usage classified as "system CPU". If system CPU usage is 30% or more of the total CPU usage, your system may be experiencing this issue.

What to do:

  Note: In the following instructions, defrag_file_pathname depends on your operating system:

  • Red Hat/CentOS: /sys/kernel/mm/redhat_transparent_hugepage/defrag
  • Ubuntu/Debian, OEL, SLES: /sys/kernel/mm/transparent_hugepage/defrag
  1. To see whether transparent hugepage compaction is enabled, run the following command and check the output:

    $ cat defrag_file_pathname
    • [always] never means that transparent hugepage compaction is enabled.
    • always [never] means that transparent hugepage compaction is disabled.
  2. To disable transparent hugepage compaction, add the following command to /etc/rc.local :
     echo never > defrag_file_pathname

You can also disable transparent hugepage compaction interactively (but remember this will not survive a reboot).

To disable transparent hugepage compaction temporarily as root:

# echo 'never' > defrag_file_pathname 

To disable transparent hugepage compaction temporarily using sudo:

$ sudo sh -c "echo 'never' > defrag_file_pathname" 

Setting the vm.swappiness Linux Kernel Parameter

vm.swappiness is a Linux Kernel Parameter that controls how aggressively memory pages are swapped to disk. It can be set to a value between 0-100; the higher the value, the more aggressive the kernel is in seeking out inactive memory pages and swapping them to disk.

You can see what value vm.swappiness is currently set to by looking at /proc/sys/vm; for example:

cat /proc/sys/vm/swappiness

On most systems, it is set to 60 by default. This is not suitable for Hadoop clusters nodes, because it can cause processes to get swapped out even when there is free memory available. This can affect stability and performance, and may cause problems such as lengthy garbage collection pauses for important system daemons. Cloudera recommends that you set this parameter to 0; for example:

# sysctl -w vm.swappiness=0 

Performance Enhancements in Shuffle Handler and IFile Reader

As of CDH4.1, the MapReduce shuffle handler and IFile reader use native Linux calls (posix_fadvise(2) and sync_data_range) on Linux systems with Hadoop native libraries installed. The subsections that follow provide details.

Shuffle Handler

You can improve MapReduce Shuffle Handler Performance by enabling shuffle readahead. This causes the TaskTracker or Node Manager to pre-fetch map output before sending it over the socket to the reducer.

  • To enable this feature for YARN, set the mapreduce.shuffle.manage.os.cache property to true (default). To further tune performance, adjust the value of themapreduce.shuffle.readahead.bytes property. The default value is 4MB.
  • To enable this feature for MRv1, set the mapred.tasktracker.shuffle.fadvise property to true (default). To further tune performance, adjust the value of themapred.tasktracker.shuffle.readahead.bytes property. The default value is 4MB.

IFile Reader

Enabling IFile readahead increases the performance of merge operations. To enable this feature for either MRv1 or YARN, set the mapreduce.ifile.readahead property totrue (default). To further tune the performance, adjust the value of the mapreduce.ifile.readahead.bytes property. The default value is 4MB.

Best Practices for MapReduce Configuration

The configuration settings described below can reduce inherent latencies in MapReduce execution. You set these values in mapred-site.xml.

Send a heartbeat as soon as a task finishes

Set the mapreduce.tasktracker.outofband.heartbeat property to true to let the TaskTracker send an out-of-band heartbeat on task completion to reduce latency; the default value is false:

<property>
<name>mapreduce.tasktracker.outofband.heartbeat</name>
<value>true</value>
</property>

Reduce the interval for JobClient status reports on single node systems

The jobclient.progress.monitor.poll.interval property defines the interval (in milliseconds) at which JobClient reports status to the console and checks for job completion. The default value is 1000 milliseconds; you may want to set this to a lower value to make tests run faster on a single-node cluster. Adjusting this value on a large production cluster may lead to unwanted client-server traffic.

<property>
<name>jobclient.progress.monitor.poll.interval</name>
<value>10</value>
</property>

Tune the JobTracker heartbeat interval

Tuning the minimum interval for the TaskTracker-to-JobTracker heartbeat to a smaller value may improve MapReduce performance on small clusters.

<property>
<name>mapreduce.jobtracker.heartbeat.interval.min</name>
<value>10</value>
</property>

Start MapReduce JVMs immediately

The mapred.reduce.slowstart.completed.maps property specifies the proportion of Map tasks in a job that must be completed before any Reduce tasks are scheduled. For small jobs that require fast turnaround, setting this value to 0 can improve performance; larger values (as high as 50%) may be appropriate for larger jobs.

<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>0</value>
</property>

Best practices for HDFS Configuration

This section indicates changes you may want to make in hdfs-site.xml.

Improve Performance for Local Reads

  Note:

Also known as short-circuit local reads, this capability is particularly useful for HBase and Cloudera Impala™. It improves the performance of node-local reads by providing a fast path that is enabled in this case. It requires libhadoop.so (the Hadoop Native Library) to be accessible to both the server and the client.

libhadoop.so is not available if you have installed from a tarball. You must install from an .rpm, .deb, or parcel in order to use short-circuit local reads.

Configure the following properties in hdfs-site.xml as shown:

<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property> <property>
<name> dfs.client.read.shortcircuit.streams.cache.size</name>
<value>1000</value>
</property> <property>
<name> dfs.client.read.shortcircuit.streams.cache.size.expiry.ms</name>
<value>1000</value>
</property> <property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
  Note:

The text _PORT appears just as shown; you do not need to substitute a number.

If /var/run/hadoop-hdfs/ is group-writable, make sure its group is root.

Tips and Best Practices for Jobs

This section describes changes you can make at the job level.

Use the Distributed Cache to Transfer the Job JAR

Use the distributed cache to transfer the job JAR rather than using the JobConf(Class) constructor and the JobConf.setJar() and JobConf.setJarByClass() method.

To add JARs to the classpath, use -libjars <jar1>,<jar2>, which will copy the local JAR files to HDFS and then use the distributed cache mechanism to make sure they are available on the task nodes and are added to the task classpath.

The advantage of this over JobConf.setJar is that if the JAR is on a task node it won't need to be copied again if a second task from the same job runs on that node, though it will still need to be copied from the launch machine to HDFS.

  Note:

-libjars works only if your MapReduce driver uses ToolRunner. If it doesn't, you would need to use the DistributedCache APIs (Cloudera does not recommend this).

For more information, see item 1 in the blog post How to Include Third-Party Libraries in Your MapReduce Job.

Changing the Logging Level on a Job (MRv1)

You can change the logging level for an individual job. You do this by setting the following properties in the job configuration (JobConf):

  • mapreduce.map.log.level
  • mapreduce.reduce.log.level

Valid values are NONE, INFO, WARN, DEBUG, TRACE, and ALL.

Example:

JobConf conf = new JobConf();
... conf.set("mapreduce.map.log.level", "DEBUG");
conf.set("mapreduce.reduce.log.level", "TRACE");
...

Improving Performance【转】的更多相关文章

  1. TIPS FOR IMPROVING PERFORMANCE OF KAFKA PRODUCER

    When we are talking about performance of Kafka Producer, we are really talking about two different t ...

  2. R12: Improving Performance of General Ledger and Journal Import (Doc ID 858725.1 )

    In this Document   Purpose   Scope   Details   A) Database Init.ora Parameters   B) Concurrent Progr ...

  3. MySQL Crash Course #21# Chapter 29.30. Database Maintenance & Improving Performance

    终于结束这本书了,最后两章的内容在官方文档中都有详细介绍,简单过一遍.. 首先是数据备份,最简单直接的就是用 mysql 的内置工具 mysqldump MySQL 8.0 Reference Man ...

  4. Chapter 6 — Improving ASP.NET Performance

    https://msdn.microsoft.com/en-us/library/ff647787.aspx Retired Content This content is outdated and ...

  5. 提高神经网络的学习方式Improving the way neural networks learn

    When a golf player is first learning to play golf, they usually spend most of their time developing ...

  6. PatentTips - Optimizing Write Combining Performance

    BACKGROUND OF THE INVENTION The use of a cache memory with a processor facilitates the reduction of ...

  7. kafka性能参数和压力测试揭秘

    转自:http://blog.csdn.net/stark_summer/article/details/50203133 上一篇文章介绍了Kafka在设计上是如何来保证高时效.大吞吐量的,主要的内容 ...

  8. neo4j-jersey分嵌入式和服务式连接图形数据库

    原文载自:http://blog.csdn.net/yidian815/article/details/12887259 嵌入式: 引入neo4j依赖 <dependency> <g ...

  9. VBA 获取Sheet最大行

    compared all possibilities with a long test sheet: 0,140625 sec for lastrow = calcws.Cells.Find(&quo ...

随机推荐

  1. 关于js中的回调函数callback

    来源于:http://www.jianshu.com/p/6bc353e5f7a3 前言 其实我一直很困惑关于js 中的callback,困惑的原因是,学习中这块看的资料少,但是平时又经常见,偶尔复制 ...

  2. JavaScript Cookies,创建,获取cookies value

    什么是cookie? cookie 是存储于访问者的计算机中的变量.每当同一台计算机通过浏览器请求某个页面时,就会发送这个 cookie.你可以使用 JavaScript 来创建和取回 cookie ...

  3. Android----Thread+Handler 线程 消息循环(转载)

    近来找了一些关于android线程间通信的资料,整理学习了一下,并制作了一个简单的例子. andriod提供了 Handler 和 Looper 来满足线程间的通信.例如一个子线程从网络上下载了一副图 ...

  4. sqlserver学习笔记(三)—— 为数据库添加新的用户

    首先,用windows或sa身份登录sqlserver, 打开安全性——登录名——右键新建登录名:在选择页——常规中,新建命为user_b的登录名,选择sqlserver身份验证方式,设置密码确认密码 ...

  5. hibernate 注解 boolean问题解决方案

    1.JPA本身是不支持boolean.可以用Hibernater自带的标签.修改如下. @Column(name = "manager_log") @org.hibernate.a ...

  6. 反射已经"Out",动态编译才能"Hold"住

    Net支持反射功能以后,确实使我们Net程序眼前一亮啊,真是太神奇了,只需要传入字符串就可以完成功能.可以说,反射功能的引入,使我们在处理某些问题上更加得心应手. 传统的Db管理软件中,数据库字段的频 ...

  7. HDU2594 Simpsons’ Hidden Talents 【KMP】

    Simpsons' Hidden Talents Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java ...

  8. java 传址或传值

    原文链接: http://blog.csdn.net/jdluojing/article/details/6962893 java是传值还是传址,这个问题已经讨论了很久了,有些人说是传值的,有些人说要 ...

  9. (面试题)如何查找Oracle数据库中的重复记录

    今天做了个面试题:查找Oracle数据库中的重复记录,下面详细介绍其他方法(参考其他资料) 本文介绍了几种快速查找ORACLE数据库中的重复记录的方法. 下面以表table_name为例,介绍三种不同 ...

  10. iphone5刷机教程

    如果不想麻烦可以在越狱之后添加源,cydia.china3gpp.com打ios7的补丁就可以了 机器为iphone5 美国sprint有锁版 1. 首先备份需要的程序和数据(把各种缓存的影片删掉再备 ...