原文:http://java-performance.info/changes-to-string-java-1-7-0_06/

作者:Mikhail Vorontsov

IMPORTANT: Java 7 update 17 still has no changes in the subject of this article.
一直到jdk7u17,本文所说的仍然适用。
Sharing an underlying char[]
共用底层的char[]
An original String implementation has 4 non static field: 
char[] value with string characters, 
int offset and int count with an index of the first character to use from value and a number of characters to use and 
int hash with a cached value of a String hash code. 
As you can see, in a very large number of cases a String will have offset = 0 and count = value.length. 
The only exception to this rule were the strings created via String.substring calls and all API calls using this method internally (like Pattern.split).
String.substring created a String, which shared an internal char[] value with an original String, which allowed you:
To save some memory by sharing character data
To run String.substring in a constant time ( O(1) )
At the same time such feature was a source of a possible memory leak: 
if you extract a tiny substring from an original huge string and discard that original string, 
you will still have a live reference to the underlying huge char[] value taken from an original String. 
The only way to avoid it was to call a new String( String ) constructor on such string – 
it made a copy of a required section of underlying char[], thus unlinking your shorter string from its longer “parent”.
From Java 1.7.0_06 (as well as in early versions of Java 8) offset and count fields were removed from a String. 
This means that you can’t share a part of an underlying char[] value anymore. 
Now you can forget about a memory leak described above and never ever use new String(String) constructor anymore. 
As a drawback, you now have to remember that String.substring has now a linear complexity instead of a constant one.
原来String的实现有4个非static的字段:
char[]字符数组
int offset(数组里面第一个有效字符的偏移量)+int count(有效字符的个数)
int hash:缓存了hash值
实际上,很大一部分的String的offset是0,count是字符数组的长度,唯一例外的是,调用String.substring或者其他内部调用了这个方法的api生成出来的字符串(比如:Pattern.split)。
String.substring会创建一个String,新创建出来的String会和之前的String底层会共享同一个char[],这就允许:
(1)通过共享字符数组来节省内存
(2)String.substring的时间复杂度是O(1)
但是,这也带来了潜在的内存泄露的风险:
如果你从一个非常巨大的字符串里面截取一个很小的子串,就算把原始的大串丢弃掉,你仍然持有原始字符串的char[]数组的活的引用。
唯一避免这个问题的方式是在截取出来的字符串上调用new String( String ),这会对底层char[]的有效部分做拷贝,就把子串和原始的长串分离了。
从1.7.0_06(一直到早期的jdk8),String内部都移除了offset和count,这就意味着,你无法再像以前和一样在多个字符串之间共享char[]数组。
现在你可以忘掉刚才描述的内存泄露这件事了,也不用再使用new String(String)这个构造函数了。
但是,很遗憾,现在String.substring的时间复杂度变成线性的而不再是常量了。

Changes to hashing logic
hash算法逻辑的改变
There is another change introduced to String class in the same update: a new hashing algorithm. 
Oracle suggests that a new algorithm gives a better distribution of hash codes, 
which should improve performance of several hash-based collections: HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet, WeakHashMap and ConcurrentHashMap.
Unlike changes from the first part of this article, these changes are experimental and turned off by default.
String里面还有另一个改变:一个新的hash算法。
oracle说一个新的hash算法会提供更好的hashcode分布,这会改善许多基于hashcode的集合的性能,比如:HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet, WeakHashMap and ConcurrentHashMap。
不像上面说的那个改变,这个改变还处在试验阶段,默认是关闭的。
As you may guess, these changes are only for String keys. If you want to turn them on, 
you’ll have to set a jdk.map.althashing.threshold system property to a non-negative value (it is equal to -1 by default). 
This value will be a collection size threshold, after which a new hashing method will be used. 
A small remark here: hashing method will be changed on rehashing only (when there is no more free space). 
So, if a collection was rehashed last time at size = 160 and jdk.map.althashing.threshold = 200, 
then a method will only be changed when your collection will grow to size of 320 (approximately).
正如你猜测的那样,这些改变只是对String当键值的有影响。如果你想启用这个特性,你需要设置jdk.map.althashing.threshold这个系统属性的值为一个非负数(默认是-1)
这个值代表了一个集合大小的threshold,超过这个值,就会使用新的hash算法。
需要注意的一点,只有当re-hash的时候,新的hash算法才会起作用。
因此,如果集合上次rehash的时候的大小是160,jdk.map.althashing.threshold的值是200,那么,只有当集合的大小大概在320的时候,才会使用新的hash算法。

String now has a hash32() method, which result is cached in int hash32 field. 
The biggest difference of this method is that the result of hash32() on the same string may be different on various JVM runs 
(actually, it will be different in most cases, because it uses a single System.currentTimeMillis() and two System.nanoTime calls for seed initialization). 
As a result, iteration order on some of your collections will be different each time you run your program.
现在的String有一个hash32()方法,它的值被缓存在hash32这个字段里面(jdk8不是这样的)
这个方法最大的不同是对同一个字符串,在不同的jvm上的hash32()的结果可能会不一样
(实际上大多数情况下都不一样,因为它使用了一个System.currentTimeMillis()和两个System.nanoTime来初始化种子)
结果,每次执行程序的时候,对同一个集合的迭代顺序都是不一样的。

Actually, I was a little surprised by this method. Why do we need it if an original hashCode method works very good? 
I decided to try a test program from hashCode method performance tuning article in order to find out 
how many duplicate hash codes we will have with a hash32 method.
实际上我对这个方法感到有点吃惊。如果之前的hashcode工作的好好的,为啥要改呢?
我决定做一个实验,来找出使用了hash32()以后,会有多少重复的hash值。

String.hash32() method is not public, so I had to take a look at a HashMap implementation in order to find out how to call this method. 
The answer is sun.misc.Hashing.stringHash32(String).
String.hash32()不是public的,所以我不得不查看了下HashMap的内部实现,来找到怎么调用这个方法,答案就是sun.misc.Hashing.stringHash32(String)。

The same test on 1 million distinct keys has shown 304 duplicate hash values, compared to no duplicates while using String.hashCode. 
I think, we need to wait for further improvements or use case descriptions from Oracle.

同样的测试100万个不同的键值,hash32()有304个重复的hashcode,但是String.hashCode却没有一个重复的。
我认为,我们得等oracle做进一步的优化才可以使用。

New hashing may severely affect highly multithreaded code
Oracle has made a bug in the implementation of hashing in the following classes: HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet and WeakHashMap. 
Only ConcurrentHashMap is not affected. The problem is that all non-concurrent classes now have the following field:
新的hash算法可能会严重影响多线程的代码。
oracle对HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet and WeakHashMap这几个类的hash算法的实现有bug,只有ConcurrentHashMap没有bug。
原因在于所有的这几个非同步的类都有一个字段:
/**
 * A randomizing value associated with this instance that is applied to
 * hash code of keys to make hash collisions harder to find.
 * 和本实例相关联的一个随机数,用来计算key的hash值,它能减少hash碰撞。
 */
transient final int hashSeed = sun.misc.Hashing.randomHashSeed(this);

This means that for every created map/set instance sun.misc.Hashing.randomHashSeed method will be called. randomHashSeed, 
in turn, calls java.util.Random.nextInt method. Random class is well known for its multithreaded unfriendliness: 
it contains private final AtomicLong seed field. Atomics behave well under low to medium contention, but work extremely bad under heavy contention.
这意味着,每一个map/set的实例,都会调用sun.misc.Hashing.randomHashSeed()方法。randomHashSeed会调用java.util.Random.nextInt().
Random这个类是多线程不友好的,因为它含有private final AtomicLong的字段作为种子。在竞争少的情况下Atomics工作的很好,但是,竞争大的情况下性能极差。

As a result, many highly loaded web applications processing HTTP/JSON/XML requests may be affected by this bug, 
because nearly all known parsers use one of the affected classes for “name-value” representation. 
All these format parsers may create nested maps, which further increases the number of maps created per second.
结果,大多数处理HTTP/JSON/XML这样的请求的web应用都会被这个bug所影响。
因为,几乎所有能叫得出口的解析器都会使用受影响的这几个类来表示name-value。
所有这些解析器内部可能都会创建嵌套的map,这会进一步增加每秒钟创建的map的数量。

How to solve this problem?
怎么解决这个问题呢:
1. ConcurrentHashMap way: call randomHashSeed method only when jdk.map.althashing.threshold system property was defined. 
Unfortunately, it is available only for JDK core developers.
1.向ConcurrentHashMap那样:只有当jdk.map.althashing.threshold系统属性定义的时候才调用randomHashSeed()方法,很不幸,只有jdk的核心开发者才能这么干。
/**
 * A randomizing value associated with this instance that is applied to
 * hash code of keys to make hash collisions harder to find.
 */
private transient final int hashSeed = randomHashSeed(this);
 
private static int randomHashSeed(ConcurrentHashMap instance) {
    if (sun.misc.VM.isBooted() && Holder.ALTERNATIVE_HASHING) {
        return sun.misc.Hashing.randomHashSeed(instance);
    }
 
    return 0;
}
2. Hacker way: fix sun.misc.Hashing class. Highly not recommended. If you still wish to go ahead, here is an idea: java.util.Random class is not final. 
You can extend it and override its nextInt method, returning something thread-local (a constant, for example). 
Then you will have to update sun.misc.Hashing.Holder.SEED_MAKER field – set it to your extended Random class instance. 
Don’t worry that this field is private, static and final – reflection can help you:
2.手动打补丁:修改sun.misc.Hashing这个类。非常的不推荐。如果你想这么干,可以这样:java.util.Random类不是final的,你可以继承这个类,覆盖里面的nextInt()方法,
返回thread-local的值(甚至可以是常量)。然后修改sun.misc.Hashing.Holder.SEED_MAKER这个字段,让它成为你继承的Random类的实例。
别担心这个字段是private static final的,用反射来搞定:
public class Hashing {
    private static class Holder {
        static final java.util.Random SEED_MAKER;
3. Buddhist way – do not upgrade to Java 7u6 and higher. Check new Java 7 update sources for this bug fix. 
Unfortunately, nothing has changed even in Java 7u17…
3.干脆不要升级到jdk7u6和更高版本,检查java的更新版本看看是否修复了这个bug。
Summary
总结
From Java 1.7.0_06 String.substring always creates a new underlying char[] value for every String it creates. 
This means that this method now has a linear complexity compared to previous constant complexity. 
The advantage of this change is a slightly smaller memory footprint of a String (4 bytes less than before) 
and a guarantee to avoid memory leaks caused by String.substring.
Starting from the same Java update, String class got a second hashing method called hash32. 
This method is currently not public and could be accessed without reflection only via sun.misc.Hashing.stringHash32(String) call. 
This method is used by 7 JDK hash-based collections if their size will exceed jdk.map.althashing.threshold system property.
This is an experimental function and currently I don’t recommend using it in your code.
Starting from Java 1.7.0_06 all standard JDK non-concurrent maps and sets are affected by a performance bug caused by new hashing implementation. 
This bug affects only multithreaded applications creating heaps of maps per second. See this article for more details.
从jdk1.7.0_06开始,String.substring()总会创建一个新的字符串,包括底层的char[]也是新的。
也就是说,这个方法现在的时间复杂度是线性的,而之前是常量。
这个而改变的好处是每一个string占的内存字节数减少了4个字节,而且还避免了因为调用String.substring导致的内存泄露。
从这个版本开始,String类有了第二个hash方法,叫做hash32()
这个方法暂时还不是公开的,如果不用反射,只能sun.misc.Hashing.stringHash32(String)才能调用。这个方法用在jdk7的使用hash值的集合类上,如果集合的大小超过了jdk.map.althashing.threshold这个系统变量的值。
这个方法现在还处于试验阶段,当前不推荐使用。
从Java 1.7.0_06开始,因为新的hash算法实现的bug,所有标准的jdk的非同步的map和set的性能都会受影响。这个bug只会影响每秒创建大量map的多线程的应用。
现在的jdk8中有两个成员:
private final char value[];
private int hash;
hashcode()算法和jdk6中是一样的:
jdk8:
public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }
jdk6:
public int hashCode() {
int h = hash;
        int len = count;
if (h == 0 && len > 0) {
   int off = offset;
   char val[] = value;

for (int i = 0; i < len; i++) {
                h = 31*h + val[off++];
            }
            hash = h;
        }
        return h;
    }
参考这个:http://www.iteye.com/topic/626801,String.substring()导致内存泄露,写得非常好!

java.lang.String内部结构的变化的更多相关文章

  1. java.lang.String小测试

    还记得java.lang.String么,如果现在给你一个小程序,你能说出它的结果么 public static String ab(String a){ return a + "b&quo ...

  2. JDK1.8源码(三)——java.lang.String 类

    String 类也是java.lang 包下的一个类,算是日常编码中最常用的一个类了,那么本篇博客就来详细的介绍 String 类. 1.String 类的定义 public final class ...

  3. java.lang.String.getBytes(String charsetName)方法实例

    java.lang.String.getBytes(String charsetName) 方法编码将此String使用指定的字符集的字节序列,并将结果存储到一个新的字节数组. 声明 以下是java. ...

  4. hibernate报错Unknown integral data type for ids : java.lang.String

    package com.model; // Generated 2016-10-27 14:02:17 by Hibernate Tools 4.3.1.Final /** * CmDept gene ...

  5. 前台传参数时间类型不匹配:type 'java.lang.String' to required type 'java.util.Date' for property 'createDate'

    springMVC action接收参数: org.springframework.validation.BindException: org.springframework.validation.B ...

  6. 记录maven java.lang.String cannot be cast to XX error

    在项目开发中自定义了一个maven plugin,在本地能够很好的工作,但是在ci server上却无法正常工作报错为: --------------------------------------- ...

  7. javax.el.PropertyNotFoundException: Property 'name' not found on type java.lang.String

    javax.el.PropertyNotFoundException: Property 'name' not found on type java.lang.String javax.el.Bean ...

  8. Cannot convert value of type [java.lang.String] to required type [java.util.Date] for property 'xxx': no matching editors or conversion strategy found

    今天在完成项目的时候遇到了下面的异常信息: 04-Aug-2014 15:49:27.894 SEVERE [http-apr-8080-exec-5] org.apache.catalina.cor ...

  9. [Ljava.lang.String和java.lang.String区别

    在做项目时报了一个got class [Ljava.lang.String的提示,当时看到[Ljava.lang.String这个时,感觉有点怪怪的,第一次遇到这种情况.最后在网上查了下才明白.是数组 ...

随机推荐

  1. jqery基础知识

    选择器按属性,子元素,表单匹配元素 <!doctype html> <html lang="en"> <head> <meta chars ...

  2. 解读Spring Ioc容器设计图

    在Spring Ioc容器的设计中,有俩个主要的容器系列:一个是实现BeanFactory接口的简单容器系列,这系列容器只实现了容器最基本的功能:另外一个是ApplicationContext应用上下 ...

  3. UNIX线程之间的关系

    我们在一个线程中经常会创建另外的新线程,如果主线程退出,会不会影响它所创建的新线程呢?下面就来讨论一下. 1.  主线程等待新线程先结束退出,主线程后退出.正常执行. 示例代码: #include & ...

  4. static与get属性的作用

    一.Static 用于没有属性的类中,不用保存属性的值,例如 var user=new User(): user.Name="jack" 可以直接调用类中的方法,避免需要多次访问该 ...

  5. flexpaper 开源轻量级的在浏览器上显示各种文档的组件

    FlexPaper是一个开源轻量级的在浏览器上显示各种文档的组件,被设计用来与PDF2SWF一起使用, 使在Flex中显示PDF成为可能,而这个过程并无需PDF软件环境的支持.它可以被当做Flex的库 ...

  6. CSS边框属性二---border-images

    border-images 属性 主要用border-images 属性来制作自适应按钮和tab标签&自适应边框. 例子: border-images:url("img.png&qu ...

  7. STUN/TURN/ICE协议在P2P SIP中的应用(一)

    1           说明 本文详细描述了基于STUN系列协议实现的P2P SIP电话过程,其中涉及到了SIP信令的交互,P2P的原理,以及STUN.TURN.ICE的协议交互 本文所提到的各个服务 ...

  8. HttpUtility.HtmlEncode

    HttpUtility.HtmlEncode用来防止站点受到恶意脚本注入的攻击 public string Welcome(string name, int numTimes = 1) {     r ...

  9. Oracle数据库小知识,改数据库数据

    在一张表上面右键-->查询数据,会生成sql语句,表后面带有t,表示模糊查询, 后面跟上for update之后,执行语句-->小锁(编辑数据)就可以修改数据里面的数据了,修改之后--&g ...

  10. 再看JavaScript线程

    继上篇讨论了一些关于JavaScript线程的知识,我们不妨回过头再看看,是不是JavaScript就不能多线程呢?看下面一段很简单的代码(演示用,没考虑兼容问题): 代码判断一: <div i ...