FuzzyScore

模糊分数

implementation 'org.apache.commons:commons-text:1.10.0'

A matching algorithm that is similar to the searching algorithms implemented in editors such as Sublime Text, TextMate, Atom and others.
One point is given for every matched character. Subsequent matches yield two bonus points. A higher score indicates a higher similarity.
This code has been adapted from Apache Commons Lang 3.3.
Since:
1.0

是一种基于分数的匹配算法,分数越高越相似

官方注释:

Find the Fuzzy Score which indicates the similarity score between two Strings.
score.fuzzyScore(null, null) = IllegalArgumentException
score.fuzzyScore("not null", null) = IllegalArgumentException
score.fuzzyScore(null, "not null") = IllegalArgumentException
score.fuzzyScore("", "") = 0
score.fuzzyScore("Workshop", "b") = 0
score.fuzzyScore("Room", "o") = 1
score.fuzzyScore("Workshop", "w") = 1
score.fuzzyScore("Workshop", "ws") = 2
score.fuzzyScore("Workshop", "wo") = 4
score.fuzzyScore("Apache Software Foundation", "asf") = 3
Params: term – a full term that should be matched against, must not be null query – the query that will be matched against a term, must not be null
Returns: result score
Throws: IllegalArgumentException – if the term or query is null

以程序为例:
已知两个字符串
a = "Workshop"

b = "wo"

程序示例

实现很简单,一个函数就搞定了fuzzyScore
此函数以term为基准,用入参query去查询,因此previousMatchingCharacterIndex对term的索引进行记录

解释:w在Workshop中出现1次,score = 1。 当用o遍历时,由于w已匹配,此时o匹配的分数为1+2=3,遍历到后半部分shop时,前一个字符h没有匹配到,所以匹配此次的o分数为1,总和为 1 + 3 + 1 = 4。

public Integer fuzzyScore(final CharSequence term, final CharSequence query) {
    if (term == null || query == null) {
        throw new IllegalArgumentException("CharSequences must not be null");
    }

    // fuzzy logic is case insensitive. We normalize the Strings to lower
    // case right from the start. Turning characters to lower case
    // via Character.toLowerCase(char) is unfortunately insufficient
    // as it does not accept a locale.
    final String termLowerCase = term.toString().toLowerCase(locale);
    final String queryLowerCase = query.toString().toLowerCase(locale);

    // the resulting score
    int score = 0;

    // the position in the term which will be scanned next for potential
    // query character matches
    int termIndex = 0;

    // index of the previously matched character in the term
    int previousMatchingCharacterIndex = Integer.MIN_VALUE;
    // 对query进行遍历
    for (int queryIndex = 0; queryIndex < queryLowerCase.length(); queryIndex++) {
        final char queryChar = queryLowerCase.charAt(queryIndex);

        boolean termCharacterMatchFound = false;
        for (; termIndex < termLowerCase.length()
                && !termCharacterMatchFound; termIndex++) {
            final char termChar = termLowerCase.charAt(termIndex);
            // 找到相同字符
            if (queryChar == termChar) {
                // simple character matches result in one point
                // 有相同字符 分数+1
                score++;

                // subsequent character matches further improve
                // the score.
                // 如果前一个字符也匹配,则这次额外加2分(也就是一共3分)
                if (previousMatchingCharacterIndex + 1 == termIndex) {
                    score += 2;
                }

                // 对当前匹配字符的索引进行记录,下次遍历用
                previousMatchingCharacterIndex = termIndex;

                // we can leave the nested loop. Every character in the
                // query can match at most one character in the term.
                termCharacterMatchFound = true;
            }
        }
    }

    return score;
}
posted @ 2023-05-06 16:59  干翻苍穹  阅读(73)  评论(0编辑  收藏  举报