java过滤四字节和六字节特殊字符
java7版本中可以这样写:
source.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "*");
java6和java7版本中可以这样写:
source.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "*");
Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.
This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.
Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.
Behind the scene
Assuming that you are running your regex on Oracle's implementation, your regex
"([\ud800-\udbff\udc00-\udfff])"
is compiled as such:
StartS. Start unanchored match (minLength=1)
java.util.regex.Pattern$GroupHead
Pattern.union. A ∪ B:
Pattern.union. A ∪ B:
Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
BitClass. Match any of these 1 character(s):
[U+002D]
SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
java.util.regex.Pattern$GroupTail
java.util.regex.Pattern$LastNode
Node. Accept match
The character class is parsed as \ud800-\udbff\udc00
, -
, \udfff
. Since \udbff\udc00
forms a valid surrogate pairs, it represent the code point U+10FC00.
Wrong solution
There is no point in writing:
"[\ud800-\udbff][\udc00-\udfff]"
Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.
Solution
If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:
input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");
This solution has been tested to work in Java 6 and 7 (Oracle implementation).
The regex above compiles to:
StartS. Start unanchored match (minLength=1)
Pattern.union. A ∪ B:
Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
java.util.regex.Pattern$LastNode
Node. Accept match
Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.
// Only works in Java 7
input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")
Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \\ud800
as one character and tries to compile the range \\udc00-\\udbff
where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.
From Java 7 and above, the syntax \x{h..h}
has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.
input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");
This regex also compiles to the same structure as above.
本文转自:http://stackoverflow.com/questions/27820971/why-a-surrogate-java-regexp-finds-hypen-minus
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 三行代码完成国际化适配,妙~啊~
· .NET Core 中如何实现缓存的预热?