java过滤四字节和六字节特殊字符
java7版本中可以这样写:
source.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "*");
java6和java7版本中可以这样写:
source.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "*");
Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.
This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.
Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.
Behind the scene
Assuming that you are running your regex on Oracle's implementation, your regex
"([\ud800-\udbff\udc00-\udfff])"
is compiled as such:
StartS. Start unanchored match (minLength=1)
java.util.regex.Pattern$GroupHead
Pattern.union. A ∪ B:
Pattern.union. A ∪ B:
Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
BitClass. Match any of these 1 character(s):
[U+002D]
SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
java.util.regex.Pattern$GroupTail
java.util.regex.Pattern$LastNode
Node. Accept match
The character class is parsed as \ud800-\udbff\udc00
, -
, \udfff
. Since \udbff\udc00
forms a valid surrogate pairs, it represent the code point U+10FC00.
Wrong solution
There is no point in writing:
"[\ud800-\udbff][\udc00-\udfff]"
Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.
Solution
If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:
input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");
This solution has been tested to work in Java 6 and 7 (Oracle implementation).
The regex above compiles to:
StartS. Start unanchored match (minLength=1)
Pattern.union. A ∪ B:
Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
java.util.regex.Pattern$LastNode
Node. Accept match
Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.
// Only works in Java 7
input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")
Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800
as one character and tries to compile the range \\udc00-\\udbff
where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.
From Java 7 and above, the syntax \x{h..h}
has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.
input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");
This regex also compiles to the same structure as above.
本文转自:http://stackoverflow.com/questions/27820971/why-a-surrogate-java-regexp-finds-hypen-minus
java过滤四字节和六字节特殊字符的更多相关文章
- Java中char占用几个字节
在讨论这个问题之前,我们需要先区分unicode和UTF. unicode :统一的字符编号,仅仅提供字符与编号间映射.符号数量在不断增加,已超百万.详细:[https://zh.wikipedia. ...
- 《深入理解Java虚拟机》学习笔记之字节码执行引擎
Java虚拟机的执行引擎不管是解释执行还是编译执行,根据概念模型都具有统一的外观:输入的是字节码文件,处理过程是字节码解析的等效过程,输出的是执行结果. 运行时栈帧结构 栈帧(Stack Frame) ...
- Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream)
Java基础-IO流对象之字节缓冲流(BufferedOutputStream与BufferedInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 在我们学习字 ...
- 关于java中char占几个字节,汉字占几个字节
我们平常说,java中char占2个字节,可又说汉字在不通的编码格式中所占的位数是不同的,比如gbk中汉字占2个字节,utf8中多数占3个字节,少数占4个.而所有汉字在java程序中我们都可以简单的用 ...
- java动态代理——字段和方法字节码的基础结构及Proxy源码分析三
前文地址:https://www.cnblogs.com/tera/p/13280547.html 本系列文章主要是博主在学习spring aop的过程中了解到其使用了java动态代理,本着究根问底的 ...
- 工作三年终于社招进字节跳动!字节跳动,阿里,腾讯Java岗面试经验汇总
前言 我大概我是从去年12月份开始看书学习,到今年的6月份,一直学到看大家的面经基本上百分之90以上都会,我就在5月份开始投简历,边面试边补充基础知识等.也是有些辛苦.终于是在前不久拿到了字节跳动的o ...
- Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式
解析:Java的IO操作中有面向字节(Byte)和面向字符(Character)两种方式.面向字节的操作为以8位为单位对二进制的数据进行操作,对数据不进行转换,这些类都是InputStream和Out ...
- “全栈2019”Java第四十六章:继承与字段
难度 初级 学习时间 10分钟 适合人群 零基础 开发语言 Java 开发环境 JDK v11 IntelliJ IDEA v2018.3 文章原文链接 "全栈2019"Java第 ...
- Java过滤特殊字符
Java正则表达式过滤 1.Java过滤特殊字符的正则表达式----转载 java过滤特殊字符的正则表达式[转载] 2010-08-05 11:06 Java过滤特殊字符的正则表达式 关键字: j ...
随机推荐
- [CF877F]Ann and Books
题目大意: 有$n(n\le10^5)$个数$w_{1\sim n}(|w_i|\le10^9)$,并给定一个数$k(|k|\le10^9)$.$q(q\le10^5)$次询问,每次询问区间$[l,r ...
- 【MySQL笔记】数据定义语言DDL
1.创建基本表 create table <表名> (<列名><数据类型>[列级完整性约束条件] ...
- Java读取文本文件
try { // 防止文件建立或读取失败,用catch捕捉错误并打印,也可以throw StringBuilder stringBuilder = new StringBuilder(); // 读入 ...
- Web服务器在外网能裸奔多久?
很多时候我们轻易地把Web服务器暴露在公网上,查看一下访问日志,可以看到会收到大量的攻击请求,这个是网站开通后几个小时收到的请求: 1. 探测服务器信息 在上线一分钟,收到OPTION请求探测. ...
- 客户端连接Redis
首先下载Jedis http://mvnrepository.com/artifact/redis.clients/jedis 然后脚本如下: package redistest; import ja ...
- Memcached网络模型
之前用libevent开发了一个流媒体服务器.用线程池实现的.之后又看了memcached的网络相关实现,今天来整理一下memcached的实现流程. memcached不同于Redis的单进程单线程 ...
- nagios学习记录
这几天开始接触nagios,记录下学习的心得 监控机上需要安装nagios,nagios-plugins, nrpe 被监控机上需要安装nagios-plugins, nrpe nagios通过插件n ...
- 深入理解Hadoop集群和网络【转】
http://os.51cto.com/art/201211/364374.htm 本文将着重于讨论Hadoop集群的体系结构和方法,及它如何与网络和服务器基础设施的关系.最开始我们先学习一下Hado ...
- Vue基础知识总结(二)
一.解决网速慢的时候用户看到花括号标记 (1)v-cloak,防止闪烁,通常用于比较大的选择器上. 给元素添加属性v-cloak,然后style里面:[v-cloak]{display:none;} ...
- 【Hadoop】HDFS客户端开发示例
1.原理.步骤 2.HDFS客户端示例代码 package com.ares.hadoop.hdfs; import java.io.FileInputStream; import java.io.F ...