正则表达式 Regex Java 案例

实用案例

查找中文：[^\x00-\xff]
去除多余空行，两个段落之间仅保留一个空行：多次将 \n\n 替换为 \n
MarkDown 格式的换行：
- 要求：两个中文段落中间如果没有空行，则加空行；英文段落因为都是代码，所以不加
- 将 ([^\x00-\xff]\n)([^\x00-\xff]) 替换为 $1\n$2

Java 中的反斜杠

在其他语言的正则表达式中，\\表示：我想要在正则表达式中插入一个普通的(字面上的)反斜杠，请不要给它任何特殊的意义。
而在 Java 的正则表达式中，\\表示：我要插入一个正则表达式的反斜线，所以其后的字符具有特殊的意义。

所以，在其他的语言中，一个反斜杠(\)就足以具有转义的作用，而在 Java 的正则表达式中则需要有两个反斜杠(\\)才能被解析为其他语言中的转义作用。也可以简单的理解在 Java 的正则表达式中，两个反斜杠(\\)代表其他语言中的一个反斜杠(\)，例如，表示一位数字的正则表达式是\\d，而表示一个普通的反斜杠是\\\\。

//在字符串中需要用【\\】表示一个普通的反斜杠【\】，而在正则表达式中需要用【\\\\】表示一个转义后的、普通的反斜杠【\】
String string = "a\\b\\c";
System.out.println(string); //【a\b\c】
System.out.println(string.replace("\\", "_\\\\_")); //【a_\\_b_\\_c】
System.out.println(string.replaceAll("\\\\", "_\\\\\\\\_")); //【a_\\_b_\\_c】

String 类中的方法

常用方法总结：

contains：普通查找(非正则)
matches：判断是否完全匹配(正则)
split：字符串切割(正则)
replace：替换所有匹配的字串(非正则)
replaceAll：替换所有匹配的字串(正则)
replaceFirst：替换首个匹配的字串(正则)

contains 普通查找

//Returns true if and only if this string contains the specified sequence of char values.
public boolean contains(CharSequence s) {
    return indexOf(s.toString()) > -1;
}

String string = "abcd";
System.out.println(string.contains("ab") + ", " + string.contains("ab.*")); //true, false

matches 正则完全匹配

功能：判断当前 String 是否完全匹配指定的正则表达式

源码

//String
//Tells whether or not this string matches the given regular expression
public boolean matches(String regex) {
    return Pattern.matches(regex, this);
}

//Pattern
//Compiles the given regular expression and attempts to match the given input against it.
public static boolean matches(String regex, CharSequence input) {
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(input);
    return m.matches();
}

//Matcher
//return true if, and only if, the entire region sequence matches this matcher's pattern
public boolean matches() {
    return match(from, ENDANCHOR);
}

由此可见，String#matches方法的功能是：判断当前String是否完全匹配指定的正则表达式(而不是判断是否包含匹配指定正则表达式的字串)。其和直接调用Pattern#matches方法或Matcher#matches方法类似。

案例

String string = "abcd";
System.out.println(string.matches("bc") + ", " + string.matches("bc.*") + ", " + string.matches(".*bc.*")); //false, false, true

//匹配手机号码是否正确
String string = "13512345678";
String regex = "1[358]\\d{9}";//【1[358]\d{9}】

boolean isMatch = string.matches(regex) && Pattern.matches(regex, string) && Pattern.compile(regex).matcher(string).matches();
System.out.println(isMatch + ", " + "12512345678".matches(regex));//true, false

split 正则切割

其和直接调用Pattern#split方法类似。

源码

//String
//return the array of strings computed by splitting this string around matches of the given regular expression
//返回通过将字符串拆分为给定正则表达式的匹配项而计算出的字符串数组
public String[] split(String regex) {
    return split(regex, 0);
}

public String[] split(String regex, int limit) {
    //...
    return Pattern.compile(regex).split(this, limit);
}

//Pattern
public String[] split(CharSequence input) {
    return split(input, 0);
}

public String[] split(CharSequence input, int limit) {
    //ArrayList<String> matchList = new ArrayList<>();
    //...
    return matchList.subList(0, resultSize).toArray(result);
}

案例

String string = "000|成功|100";
String regex = "\\|";

String[] splitStrs = Pattern.compile(regex).split(string);
String[] splitStrs2 = string.split(regex);

System.out.println(Arrays.equals(splitStrs, splitStrs2));//true
System.out.println(Arrays.toString(splitStrs));//[000, 成功, 100]

String regex = "(.)\\1{2,}";//【(.)\1{2,}】注意引用分组时要加转义字符
String string = "zhanggangtttxiaoqiangmmmmmmzhaoliu";

String[] splitStrs = string.split(regex);
System.out.println(Arrays.toString(splitStrs));//[zhanggang, xiaoqiang, zhaoliu]

探究参数 limit

测试参数 limit 对结果的影响：

If the limit n is greater than zero then the pattern will be applied at most n-1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. 除了最后匹配的定界符
If n is non-positive then the pattern will be applied as many times as possible and the array can have any length.
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded. 尾部的空字符串将被丢弃

String string = "boo:and:foo";
String regex = "o";
for (int i = -1; i <= string.length(); i++) {
    String[] results = string.split(regex, i);
    String result = new Gson().toJson(results);
    System.out.println("| " + regex + " | " + i + " | " + results.length + " | " + result + " |");
}

Regex	Limit	Length	Result
:	-1	3	["boo","and","foo"]
:	0	3	["boo","and","foo"]
:	1	1	["boo:and:foo"]
:	2	2	["boo","and:foo"]
:	3	3	["boo","and","foo"]
:	4+	3	["boo","and","foo"]
o	-1	5	["b","",":and:f","",""]
o	0	3	["b","",":and:f"]
o	1	1	["boo:and:foo"]
o	2	2	["b","o:and:foo"]
o	3	3	["b","",":and:foo"]
o	4	4	["b","",":and:f","o"]
o	5	5	["b","",":and:f","",""]
o	6+	5	["b","",":and:f","",""]

replace 普通替换全部

源码

//Returns a string resulting from replacing all occurrences of oldChar in this string with newChar.
public String replace(char oldChar, char newChar) {}

//Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence. The replacement proceeds from the beginning of the string to the end.
public String replace(CharSequence target, CharSequence replacement) {
    return Pattern.compile(target.toString(), Pattern.LITERAL) //使用 Pattern.LITERAL 模式
        .matcher(this)
        .replaceAll(Matcher.quoteReplacement(replacement.toString()));
}

//Matcher
//The String produced will match the sequence of characters in s treated as a literal sequence. Slashes('\') and dollar-signs('$') will be given no special meaning.
public static String quoteReplacement(String s) {
    if ((s.indexOf('\\') == -1) && (s.indexOf('$') == -1))
        return s; //不包含\或$时直接返回
    StringBuilder sb = new StringBuilder();
    for (int i=0; i<s.length(); i++) {
        char c = s.charAt(i);
        if (c == '\\' || c == '$') {
            sb.append('\\'); //把字符串中出现的\或$的前面再添加一个\，目的是将\和$转义为普通字符
        }
        sb.append(c);
    }
    return sb.toString();
}

这里使用的是 Pattern.LITERAL 模式：

When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters元字符 or escape-sequences转义序列 in the input sequence will be given no special meaning没有特殊含义.
The flags CASE_INSENSITIVE and UNICODE_CASE retain their impact on matching保留对匹配的影响 when used in conjunction一起 with this flag. The other flags become superfluous多余.

案例

System.out.println("aa-aaa".replace("aa", "b")); //b-ba

String str = "普通替换\\和$以及\\$和*";//【普通替换\和$以及\$和*】
System.out.println(str.replace("\\", "-")); //【普通替换-和$以及-$和*】
System.out.println(str.replace("$", "-")); //【普通替换\和-以及\-和*】
System.out.println(str.replace("\\$", "-")); //【普通替换\和$以及-和*】
System.out.println(str.replace("*", "$")); //【普通替换\和$以及\$和$】

replaceAll 正则替换全部

源码

public String replaceAll(String regex, String replacement) {
    return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}

Note that backslashes(反斜杠\) and dollar-signs(美元符号$) in the replacement string may cause the results to be different不同 than if it were being treated as a literal文字 replacement string.

Dollar-signs may be treated as references to captured-subsequences捕获的子序列, and backslashes are used to escape转义 literal characters in the replacement string.

案例

//特殊符号【\和$】
String str = "注意\\和$呀"; //【注意\和$呀】
System.out.println(str.replaceAll("注意", "注\\\\意")); //【注\意\和$呀】
System.out.println(str.replaceAll("\\$", "\\\\")); //【注意\和\呀】
System.out.println(str.replaceAll("$$$$$$$$$", "\\\\")); //【注意\和$呀\】这个怎么解释？
System.out.println(str.replaceAll("\\\\", "\\$")); //【注意$和$呀】

System.out.println("aabfooaabfooabfoob".replaceAll("a*b", "-"));//-foo-foo-foo-

//电话号码脱敏
String regex = "(\\d{3})(\\d{4})(\\d{4})"; //【(\d{3})(\d{4})(\d{4})】$1代表引用第一组中的内容
System.out.println("15800001111".replaceAll(regex, "$1$2****"));//1580000****
System.out.println("15800001111".replaceAll(regex, "$1****$3"));//158****1111

//去除叠词
String str = "我我.我我要...要要要要...要.学学学..学学编编...编编..编..程程...程程...";
System.out.println(str.replaceAll("\\.+", "").replaceAll("(.)\\1+", "$1")); //我要学编程

//让IP地址的每一段的位数相同(后续可以用来排序)
String ipString = "192.168.0.11  3.0.25.3";
String ipStringFillZeros = ipString.replaceAll("(\\d+)", "00$1");//先补2个零
System.out.println(ipStringFillZeros);//00192.00168.000.0011  003.000.0025.003

String ipStringCertainLength = ipStringFillZeros.replaceAll("0*(\\d{3})", "$1"); //然后每一段保留数字3位
System.out.println(ipStringCertainLength);//192.168.000.011  003.000.025.003

String ipStringResult = ipStringCertainLength.replaceAll("0*(\\d+)", "$1"); //处理完之后再去掉多余的0
System.out.println(ipStringResult); //192.168.0.11  3.0.25.3

replaceFirst 正则替换首个

//Replaces the first substring of this string that matches the given regular expression with the given replacement.
public String replaceFirst(String regex, String replacement) {
    return Pattern.compile(regex).matcher(this).replaceFirst(replacement);
}

java.util.regex 包简介

java.util.regex 包主要包括以下三个类：

Pattern：是一个正则表达式的编译表示。Pattern 类没有公共构造方法，必须通过调用其公共静态compile方法返回一个 Pattern 对象。
Matcher：是对输入字符串进行解释和匹配操作的引擎。Matcher 类也没有公共构造方法，必须通过调用 Pattern 对象的matcher方法来获得一个 Matcher 对象。
PatternSyntaxException：一个非强制异常类，它表示一个正则表达式模式中的语法错误。

案例1：字符串匹配

String regex = "\\b[a-z]{3}\\b";//【\b[a-z]{3}\b】匹配由三个字母组成的单词
String str = "da jia hao, ming tian bu fang jia!";

Matcher m = Pattern.compile(regex).matcher(str);
while (m.find()) {
    System.out.println(m.group());
}

jia
hao
jia

案例2：分组

String regex = "([a-z]+)(\\d+)"; //【([a-z]+)(\d+)】匹配所有 字母+数字
String line = "bqt20094哈哈abc789";

Matcher m = Pattern.compile(regex).matcher(line);
System.out.println("【" + regex + "】 groupCount = " + m.groupCount());

while (m.find()) {
    System.out.println("成功匹配到：" + m.group() + "，子串位置：[" + m.start() + "," + m.end()+"]");
    for (int i = 0; i <= m.groupCount(); i++) {
        System.out.println("\tgroup " + i + " : " + m.group(i));
    }
}

【([a-z]*)(\d+)】 groupCount = 2
成功匹配到：bqt20094，子串位置：[0,8]
    group 0 : bqt20094
    group 1 : bqt
    group 2 : 20094
成功匹配到：abc789，子串位置：[10,16]
    group 0 : abc789
    group 1 : abc
    group 2 : 789

Pattern

静态方法

Pattern compile(String regex)  //将给定的正则表达式编译到模式中
Pattern compile(String regex, int flags)  //将给定的正则表达式编译到具有给定`标志`的模式中
boolean matches(String regex, CharSequence input)  //编译给定正则表达式并尝试将给定输入与其匹配。
String quote(String s)  //返回指定 String 的字面值模式 String

普通方法

Predicate<String> asPredicate()
int flags()  //返回此模式的匹配标志
Matcher matcher(CharSequence input)  //创建匹配给定输入与此模式的匹配器
String pattern()  //返回在其中编译过此模式的正则表达式
String[] split(CharSequence input)  //围绕此模式的匹配拆分给定输入序列
String[] split(CharSequence input, int limit)  //围绕此模式的匹配拆分给定输入序列
Stream<String> splitAsStream(final CharSequence input)
String toString()  //返回此模式的字符串表示形式

Matcher

API

静态方法

String quoteReplacement(String s)  //返回指定 String 的字面值替换 String

boolean

boolean find()  //尝试查找与该模式匹配的输入序列的下一个子序列
boolean find(int start)  //重置此匹配器，然后尝试查找匹配该模式、从指定索引开始的输入序列的下一个子序列
boolean hasAnchoringBounds()  //查询此匹配器区域界限的定位
boolean hasTransparentBounds()  //查询此匹配器区域边界的透明度
boolean hitEnd()  //如果匹配器执行的最后匹配操作中搜索引擎遇到输入结尾，则返回 true
boolean lookingAt()  //尝试将从区域开头开始的输入序列与该模式匹配
boolean matches()  //尝试将整个区域与模式匹配
boolean requireEnd()  //如果很多输入都可以将正匹配更改为负匹配，则返回 true

int

int start()  //返回以前匹配的初始索引
int start(int group)  //返回在以前的匹配操作期间，由给定组所捕获的子序列的初始索引
int start(String name)
int end()  //返回最后匹配字符之后的偏移量
int end(int group)  //返回在以前的匹配操作期间，由给定组所捕获子序列的最后字符之后的偏移量
int end(String name)
int groupCount()   //返回此匹配器模式中的捕获组数
int regionStart()  //报告此匹配器区域的开始索引
int regionEnd()  //报告此匹配器区域的结束索引（不包括）

String

StringBuffer appendTail(StringBuffer sb)  //实现终端添加和替换步骤
String group()  //返回由以前匹配操作所匹配的输入子序列
String group(int group)  //返回在以前匹配操作期间由给定组捕获的输入子序列
String group(String name)
String replaceAll(String replacement)  //替换模式与给定替换字符串相匹配的输入序列的每个子序列
String replaceFirst(String replacement)  //替换模式与给定替换字符串匹配的输入序列的第一个子序列
String toString()  //返回匹配器的字符串表示形式

Matcher

Matcher appendReplacement(StringBuffer sb, String replacement)  //实现非终端添加和替换步骤
Matcher region(int start, int end)  //设置此匹配器的区域限制
Matcher reset()  //重置匹配器
Matcher reset(CharSequence input)  //重置此具有新输入序列的匹配器
Matcher useAnchoringBounds(boolean b)  //设置匹配器区域界限的定位
Matcher usePattern(Pattern newPattern)  //更改此 Matcher 用于查找匹配项的 Pattern
Matcher useTransparentBounds(boolean b)  //设置此匹配器区域边界的透明度

其他

Pattern pattern()  //返回由此匹配器解释的模式
MatchResult toMatchResult()  //作为 MatchResult 返回此匹配器的匹配状态

matches 和 lookingAt 方法

matches 和 lookingAt 方法都用来尝试匹配一个输入序列模式。不同的是，matches 要求整个序列都匹配，而 lookingAt 不需要整句都匹配，只需要从第一个字符开始匹配。

这两个方法经常在输入字符串的开始使用。

String regex = "foo";
String input = "fooo";
String input2 = "ofoo";
Matcher matcher = Pattern.compile(regex).matcher(input);
Matcher matcher2 = Pattern.compile(regex).matcher(input2);

//lookingAt()：对前面的字符串进行匹配，只要最前面的字符串能匹配到就返回 true
System.out.println(matcher.matches() + ", " + matcher.lookingAt()); //false, true
System.out.println(matcher2.matches() + ", " + matcher2.lookingAt()); //false, false

start end group 方法调用条件

find()方法的注释：

Attempts to find the next subsequence of the input sequence that matches the pattern.
This method starts at the beginning of this matcher's region, or, if a previous invocation调用 of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.
If the match succeeds then more information can be obtained via the start, end, and group methods.
Returns: true if, and only if, a subsequence of the input sequence matches this matcher's pattern

matches方法的注释：

Attempts to match the entire region against the pattern.
If the match succeeds then more information can be obtained via the start, end, and group methods.

lookingAt方法的注释：

Attempts to match the input sequence, starting at the beginning of the region, against the pattern.
Like the matches method, this method always starts at the beginning of the region; unlike that method, it does not require that the entire region be matched.
If the match succeeds then more information can be obtained via the start, end, and group methods.

所以，调用start, end, group方法之前，一定要确保调用了find、matches、lookingAt方法之一、且返回值为true，否则会抛 IllegalStateException: No match available！

String regex = "foo";
String input = "fooo";
Matcher matcher = Pattern.compile(regex).matcher(input);

System.out.println(matcher.find() + ", " + matcher.find() + ", " + matcher.find()); //true, false, false
System.out.println(matcher.matches() + ", " + matcher.lookingAt()); //false, true
System.out.println(matcher.find() + ", " + matcher.find() + ", " + matcher.find()); //false, false, false

if (matcher.lookingAt()) {
    System.out.println(matcher.group() + ": " + matcher.start() + "-" + matcher.end());//foo: 0-3
}

append* 方法

Matcher 类也提供了 appendReplacement 和 appendTail 方法用于文本替换，可以先后使用这两个方法将结果收集到现有的字符串缓冲区中。

String regex = "a*b";
String input = "aabfooabfoobkkk";
String replace = "-";
Matcher m = Pattern.compile(regex).matcher(input);

StringBuffer sb = new StringBuffer();
while (m.find()) {
    m.appendReplacement(sb, replace); //实现非终端添加和替换步骤
    System.out.println(m.group() + ", " + sb.toString()); //[aab, ab, b]
}
m.appendTail(sb); //实现终端添加和替换步骤
System.out.println(sb.toString()); //-foo-foo-kkk

aab, -
ab, -foo-
b, -foo-foo-
-foo-foo-kkk

2020-03-31

posted @ 2020-03-31 23:50 白乾涛阅读(486) 评论(0) 编辑收藏举报

刷新页面返回顶部

白乾涛

个人站点(baiqiantao.github.io) 我的GitHub(github.com/baiqiantao)

正则表达式 Regex Java 案例

目录

正则表达式 Regex Java 案例

实用案例

Java 中的反斜杠

String 类中的方法

contains 普通查找

matches 正则完全匹配

源码

案例

split 正则切割

源码

案例

探究参数 limit

replace 普通替换全部

源码

案例

replaceAll 正则替换全部

源码

案例

replaceFirst 正则替换首个

java.util.regex 包简介

案例1：字符串匹配

案例2：分组

Pattern

Matcher

API

matches 和 lookingAt 方法

start end group 方法调用条件

append* 方法

公告