构造与使用分离:命中内容高亮及合并的展示问题解决实现

在入侵检测业务中,针对文件内容型的告警详情,需要匹配命中内容的上下若干行。

第一种实现

先来看一种实现:


/**
 * Created by zhangli on 19-12-18.
 * 高亮文本工具类
 */
public class HighLightUtils {
    private static final Integer LINE_NUM = 10;
    private static final int MAX_REGEX_NUM = 10;
 
    /**
     * @param content 文本内容
     * @param keywords 关键字列表
     * @return 高亮内容段落集
     */
    public static List<MatchedContent> highlight(String content, List<String> keywords) {
        if (StringUtils.isEmpty(content) || CollectionUtils.isEmpty(keywords)) {
            return Collections.emptyList();
        }
 
        List<MatchedContent> partContentList = Lists.newArrayList();
        for (String keyword : keywords) {
            if (!content.contains(keyword)) {
                continue;
            }
            partContentList.addAll(highlight(content, escapeRegexSpecialWord(keyword)));
        }
        return partContentList;
    }
 
    /**
     * @param content 文本内容
     * @param regex 正则表达式
     * @return 高亮内容段落集
     */
    public static List<MatchedContent> highlight(String content, String regex) {
        return highlight(content, regex, MAX_REGEX_NUM, LINE_NUM);
    }
 
    public static List<MatchedContent> highlight(String content, String regex, int maxMatchNum, int lineNum) {
        if (StringUtils.isEmpty(content) || StringUtils.isEmpty(regex)) {
            return Collections.emptyList();
        }
 
        content = content.replaceAll("\\r\\n", "\n");
 
        Pattern pattern = Pattern.compile(regex);
        Matcher m = pattern.matcher(content);
 
        List<MatchedContent> partContentList = Lists.newArrayList();
        int maxNum = maxMatchNum;
        while (m.find()) {
            RegexMatchPoint regexMatchPoint = new RegexMatchPoint(m.start(), m.end());
            partContentList.add(getPartContentMap(content, regexMatchPoint, lineNum));
            if (--maxNum == 0) {
                break;
            }
        }
        return partContentList;
    }
 
    /**
     * 根据正则匹配获取高亮内容及起始行
     */
    private static MatchedContent getPartContentMap(String content, RegexMatchPoint m, int lineNum) {
        // 获取匹配内容在文件中的行数
        int startMatchLine = content.substring(0, m.getStart()).split("\\n").length;
        int endMatchLine = content.substring(0, m.getEnd()).split("\\n").length;
        // 高亮文件匹配内容
        String highlightContent = highlightOneRegexContent(content, m);
        // 截取匹配内容前后共20行(若匹配内容跨行,且大于10行,则从匹配的地方开始截取)
        String partContent = getPartContent(highlightContent, startMatchLine, endMatchLine);
        // 获取截取内容首行的行号
        int startLine = endMatchLine - lineNum + 1;
        //如果匹配的内容大于10行,则从最初匹配行开始,而不是固定的10行
        if (startMatchLine < startLine) {
            startLine = startMatchLine;
        }
 
        return MatchedContent.builder()
                .startLine(startLine < 1 ? 1 : startLine)
                .partContent(partContent)
                .build();
    }
 
    /**
     * 获取高亮行前后部分内容
     */
    private static String getPartContent(String content, Integer startMatchLine, Integer endMatchLine) {
        int start = StringUtils.ordinalIndexOf(content, "\n", endMatchLine - LINE_NUM);
        if (endMatchLine - startMatchLine > LINE_NUM) {
            start = StringUtils.ordinalIndexOf(content, "\n", startMatchLine - 1);
        }
        start = start < 0 ? 0 : start + 1;
        int end = StringUtils.ordinalIndexOf(content, "\n", endMatchLine + LINE_NUM);
        end = end < 0 ? content.length() : end;
 
        return content.substring(start, end);
    }
 
    /**
     * 高亮单个匹配的内容
     */
    private static String highlightOneRegexContent(String content, RegexMatchPoint point) {
        int start = 0;
 
        StringBuffer highlightContentSb = new StringBuffer();
        highlightContentSb.append(content.substring(start, point.getStart())).append(CommonValues.HIGH_LIGHT_START)
                .append(content.substring(point.getStart(), point.getEnd())).append(CommonValues.HIGH_LIGHT_END)
                .append(content.substring(point.getEnd()));
 
        return highlightContentSb.toString();
    }
 
    private static String escapeRegexSpecialWord(String keyword) {
        if (keyword != "") {
            String[] fbsArr = { "\\", "$", "(", ")", "*", "+", ".", "[", "]", "?", "^", "{", "}", "|" };
            for (String key : fbsArr) {
                if (keyword.contains(key)) {
                    keyword = keyword.replace(key, "\\" + key);
                }
            }
        }
        return keyword;
    }
 
    @Setter
    @Getter
    @ToString
    public static class RegexMatchPoint implements Comparable<RegexMatchPoint> {
        private Integer start;
        private Integer end;
 
        public RegexMatchPoint(Integer start, Integer end) {
            this.start = start;
            this.end = end;
        }
 
        //按开始位置排序
        @Override
        public int compareTo(RegexMatchPoint o) {
            if (start.compareTo(o.getStart()) == 0) {
                return end.compareTo(o.getEnd());
            } else {
                return start.compareTo(o.getStart());
            }
        }
 
        public RegexMatchPoint copy() {
            return new RegexMatchPoint(start, end);
        }
 
    }
}

这个实现还是不错的,至少给人很好的启发,是一个很好的改进基础。

那么,它的问题在哪里呢?

  • 按单个匹配高亮,如果是行内有多个匹配的话,就很难合并;
  • 缺乏行号的记录,所有跟行有关的东西,都是通过 split("\n") 和 substring(start, end)来实现的;
  • 合并命中内容比较困难。

构造与使用分离

何为构造与使用分离呢? 是指构造的时候,就提取出足够的必要信息; 而在使用时则运用这些信息去处理,而不是“边构建边使用”。就像编译器做代码编译和自动生成一样,应该不会是边编译边生成代码。

边构建边使用的实现,会将构建与处理耦合在一起,一旦有需要改动,就会比较困难。

很显然,如果要构造与使用分离,那么我们需要首先拿到什么内容? (命中内容的行号、起始位置、结束位置;所有文件行及行号) 应该先把这些必要信息提取出来。一旦我们确定了求解问题需要的必要信息,想出一个清晰的算法就比较自然了。

高亮内容合并展示的算法简述

步骤一:获取所有行及行号【行号,行内容】;

步骤二: 先找到所有匹配正则的字符串的行号及起始结束点 regexMatchPoints =(lineNo, start, end);

步骤三:将 regexMatchPoints 按行分组;因为行内的多个匹配合并很麻烦;

步骤四:所有匹配行号,按匹配行号排序,方便最终按行号序展示;

步骤五:按行生成高亮内容展示[行号,高亮行内容];

步骤六:按匹配行号计算起始行和结束行行号,如果已经在这个区间的行号则可以过滤(合并实现);

步骤七:根据所有起始行号和结束行号获取对应的行内容。


实现代码


/**
 * 高亮文本展示工具类
 * Created by qinshu on 2021/12/31
 */
public class HighLightUtil {
 
    private static final Logger LOG = LogUtils.getLogger(HighLightUtil.class);
 
    /** 高亮展示前后的行数 */
    private static final Integer HIGHLIGHT_LINE_NUM = 5;
 
    /** 最大匹配多少次 */
    private static final int MAX_REGEX_NUM = 10;
 
    /**
     * @param content 文本内容
     * @param regex 正则表达式
     * @return 高亮内容段落集
     */
    public static List<MatchedFileContent> highlight(String content, String regex) {
        return highlight(content, regex, MAX_REGEX_NUM, HIGHLIGHT_LINE_NUM);
    }
 
    /**
     * @param base64Content 文本内容(base64编码后的文本)
     * @param regex 正则表达式
     * @return 高亮内容段落集
     */
    public static List<MatchedFileContent> highlightBase64(String base64Content, String regex) {
        if (StringUtils.isEmpty(base64Content)) {
            return Collections.emptyList();
        }
        return highlight(Base64Utils.decodeContent(base64Content), regex);
    }
 
    public static List<MatchedFileContent> highlight(String content, String regex, int maxMatchNum, int highlightLineNum) {
        if (StringUtils.isEmpty(content) || StringUtils.isEmpty(regex)) {
            return Collections.emptyList();
        }
 
        content = content.replaceAll("\\r\\n", "\n");
 
        List<String> allLines = Arrays.asList(content.split("\n"));
        Pattern pattern = Pattern.compile(regex);
        List<RegexMatchPoint> regexMatchPoints = findAllRegexMatches(allLines, pattern);
 
        // 按行号分组,匹配高亮展示,因为单行多个匹配的高亮需要单行展示,分开后合并比较麻烦
        Map<Integer, List<RegexMatchPoint>> regexMatchPointMap = regexMatchPoints.stream().collect(Collectors.groupingBy(RegexMatchPoint::getLineNo));
 
        // highLightLineMap: [行号,高亮行]
        Map<Integer, String> highLightLineMap = new HashMap<>();
        regexMatchPointMap.forEach((lineNo, matchPointsOfLine) -> {
            highLightLineMap.put(lineNo, highlightOneLineContent(allLines.get(lineNo), matchPointsOfLine));
        } );
 
        List<MatchedFileContent> partContentList = merge(highLightLineMap, allLines, highlightLineNum);
        return partContentList.subList(0, Math.min(partContentList.size(), maxMatchNum));
    }
 
    private static List<MatchedFileContent> merge(Map<Integer, String> highLightLineMap, List<String> allLines, int highlightLineNum) {
 
        // 按行号排序
        List<Integer> highLightLineNos = Lists.newArrayList(highLightLineMap.keySet());
        Collections.sort(highLightLineNos);
 
        // 计算需要展示的行号
        List<MatchedFileLine> matchedFileLines = Lists.newArrayList();
        for (Integer highLineNo: highLightLineNos) {
            if (!exist(matchedFileLines, highLineNo)) {
                int startLine = highLineNo - highlightLineNum;
                int endLine = 0;
                if (startLine < 0) {
                    startLine = 0;
                    endLine = highLineNo + highlightLineNum;
                }
                else {
                    startLine = highLineNo - highlightLineNum + 1;
                    endLine = highLineNo + highlightLineNum;
                }
                matchedFileLines.add(new MatchedFileLine(startLine, endLine));
            }
        }
 
        return matchedFileLines.stream()
                .map(fileLine -> getMatchedFileContent(highLightLineMap, allLines, fileLine)).collect(Collectors.toList());
    }
 
    /**
     * 获取指定行号的行内容
     */
    private static String getLine(Map<Integer, String> highLightLineMap, List<String> allLines, Integer lineNo) {
        String highLightLine =  highLightLineMap.get(lineNo);
        return highLightLine != null ? highLightLine : allLines.get(lineNo);
    }
 
    private static boolean exist(List<MatchedFileLine> matchedFileLines, Integer lineNo) {
        return matchedFileLines.stream().anyMatch(fileLine -> exist(fileLine, lineNo));
    }
 
    private static boolean exist(MatchedFileLine matchedFileLine, Integer lineNo) {
        return lineNo >= matchedFileLine.getStartLine() && lineNo < matchedFileLine.getEndLine();
    }
 
    /**
     * 根据起始行号获取
     * @param highLightLineMap 高亮行
     * @param allLines 文件所有行
     * @param fileLine 匹配内容上下文行号
     * @return 匹配内容上下文及起始行号
     */
    private static MatchedFileContent getMatchedFileContent(Map<Integer, String> highLightLineMap, List<String> allLines, MatchedFileLine fileLine) {
        StringBuilder partContentBuilder = new StringBuilder();
        for (int i = fileLine.getStartLine(); i < fileLine.getEndLine() && i < allLines.size(); i++) {
            partContentBuilder.append(getLine(highLightLineMap, allLines, i) + "\n");
        }
        return new MatchedFileContent(fileLine.getStartLine() + 1, partContentBuilder.toString());
    }
 
    /**
     * 获取所有正则匹配点
     * @param allLines 文件内容的所有行
     * @param pattern 正则匹配编译表达式
     * @return 所有匹配正则表达式的字符串的位置
     */
    private static List<RegexMatchPoint> findAllRegexMatches(List<String> allLines, Pattern pattern) {
        // 先拿到所有的正则匹配点,行号从 0 开始
        List<RegexMatchPoint> regexMatchPoints = Lists.newArrayList();
        for (int i=0; i < allLines.size(); i++) {
            String line = allLines.get(i);
            Matcher m = pattern.matcher(line);
            while (m.find()) {
                RegexMatchPoint regexMatchPoint = new RegexMatchPoint(i, m.start(), m.end());
                regexMatchPoints.add(regexMatchPoint);
            }
        }
        return regexMatchPoints;
    }
 
 
    /**
     * 高亮文本内容
     */
    public static String highlightContent(String content, List<String> match) {
        if (CollectionUtils.isEmpty(match)) {
            return content;
        }
 
        try {
            for (String matchContent : match) {
                String highlightContent = String.format("%s%s%s", CommonValues.HIGH_LIGHT_START, matchContent, CommonValues.HIGH_LIGHT_END);
                content = content.replaceAll(ExprUtils.escapeExprSpecialWord(matchContent), highlightContent);
            }
        } catch (Exception e) {
            LOG.error("highlight content error, content:{}, match:{}", content, match);
        }
        return content;
    }
 
    /**
     * 高亮一行的展示
     */
    public static String highlightOneLineContent(String content, List<RegexMatchPoint> points) {
        int start = 0;
        int lastMatchEnd = 0;
        StringBuilder sb = new StringBuilder();
        for (RegexMatchPoint point: points) {
            sb.append(content, start, point.getStart()).append(CommonValues.HIGH_LIGHT_START)
                    .append(content, point.getStart(), point.getEnd()).append(CommonValues.HIGH_LIGHT_END);
            start = point.getEnd();
            lastMatchEnd = point.getEnd();
        }
        sb.append(content.substring(lastMatchEnd));
        return sb.toString();
    }
 
    @Setter
    @Getter
    @ToString
    public static class RegexMatchPoint implements Comparable<RegexMatchPoint> {
 
        private Integer lineNo;
        private Integer start;
        private Integer end;
 
        public RegexMatchPoint(Integer lineNo, Integer start, Integer end) {
            this.lineNo = lineNo;
            this.start = start;
            this.end = end;
        }
 
        public RegexMatchPoint copy() {
            return new RegexMatchPoint(lineNo, start, end);
        }
    }
 
    @Setter
    @Getter
    public static class MatchedFileLine {
        private Integer startLine;
        private Integer endLine;
        public MatchedFileLine(Integer startLine, Integer endLine) {
            this.startLine = startLine;
            this.endLine = endLine;
        }
    }
}


自测


/**
 * 高亮展示
 * Created by qinshu on 2021/12/31
 */
public class HighlightUtilTest {
 
    String content = "dependencies {\n" +
            "    testCompile group: 'junit', name: 'junit'\n" +
            "\n" +
            "    compile project(\":detect-lib\")\n" +
            "    compile project(\":connect-cli\")\n" +
            "    compile project(\":wisteria-client\")\n" +
            "    compile project(\":upload-cli\")\n" +
            "    compile project(\":scan-client\")\n" +
            "    compile(\"com.qt.qt-common:config-loader\")\n" +
            "    compile project(\":switches-lib\")\n" +
            "    compile project(\":bizevent-lib\")\n" +
            "    compile project(\":user-client\")\n" +
            "    compile project(\":notif-client\")\n" +
            "    compile project(\":detect-client\")\n" +
            "    compile project(\":job-cli\")\n" +
            "    compile('com.qt.qt-common:redis-lib')\n" +
            "    compile('com.qt.qt-common:rabbitmq-lib')\n" +
            "    compile('com.qt.qt-common:encrypt-property-lib')\n" +
            "    compile project(\":leader-latch-lib\")\n" +
            "    compile(\"com.qt.qt-common:eventflow-lib:1.0.0-SNAPSHOT\")\n" +
            "    compile(\"com.qt.qt-common:intrusion-detect-lib:1.0.1\")\n" +
            "    compile('com.qt.qt-common:mysql-lib')\n" +
            "    compile('com.qt.qt-common:rule-crypto')\n" +
            "    compile project(\":rule-lib\")\n" +
            "    compile project(\":api-auth-lib\")\n" +
            "\n" +
            "    // Spring Cloud\n" +
            "    //  配置中心\n" +
            "    compile ('org.springframework.cloud:spring-cloud-starter-zookeeper-config')\n" +
            "    //  服务发现\n" +
            "    compile ('org.springframework.cloud:spring-cloud-starter-zookeeper-discovery')\n" +
            "    compile ('com.netflix.hystrix:hystrix-javanica')\n" +
            "\n" +
            "    // Spring Boot\n" +
            "    compile('org.springframework.boot:spring-boot-starter-web')\n" +
            "    compile('org.springframework.boot:spring-boot-starter-aop')\n" +
            "    compile('org.springframework.boot:spring-boot-starter-data-redis')\n" +
            "\n" +
            "    // Spring\n" +
            "    compile('org.springframework:spring-orm')\n" +
            "    compile('org.springframework:spring-jdbc')\n" +
            "    compile('org.springframework:spring-aop')\n" +
            "\n" +
            "    // mongodb\n" +
            "    compile('org.springframework.data:spring-data-mongodb:1.10.23.RELEASE')\n" +
            "\n" +
            "    // Mysql\n" +
            "    runtime('mysql:mysql-connector-java')\n" +
            "    compile('com.zaxxer:HikariCP')\n" +
            "    compile('org.mybatis.spring.boot:mybatis-spring-boot-starter')\n" +
            "    compile('com.github.pagehelper:pagehelper-spring-boot-starter')\n" +
            "\n" +
            "    //redisson\n" +
            "    compile('io.projectreactor:reactor-core:3.2.8.RELEASE')\n" +
            "\n" +
            "    // Jackson\n" +
            "    compile('com.fasterxml.jackson.core:jackson-core')\n" +
            "    compile('com.fasterxml.jackson.core:jackson-annotations')\n" +
            "    compile('com.fasterxml.jackson.core:jackson-databind')\n" +
            "    compile('org.codehaus.jackson:jackson-core-asl')\n" +
            "\n" +
            "    compile('joda-time:joda-time')\n" +
            "    compile('commons-io:commons-io:2.5')\n" +
            "    compile('org.apache.commons:commons-lang3:3.5')\n" +
            "    compile('org.apache.commons:commons-collections4:4.1')\n" +
            "    compile('cglib:cglib:3.2.5')\n" +
            "    compile('net.java.dev.jna:jna:5.8.0')\n" +
            "    compile('org.apache.calcite:calcite-core:1.26.0')\n" +
            "\n" +
            "    // Test\n" +
            "    testCompile('org.mockito:mockito-core:2.13.0')\n" +
            "    testCompile('org.springframework:spring-test')\n" +
            "    testCompile('org.springframework.boot:spring-boot-starter-test')\n" +
            "\n" +
            "    // string-similarity\n" +
            "    compile('info.debatty:java-string-similarity:0.24')\n" +
            "\n" +
            "    compile('com.jayway.jsonpath:json-path')\n" +
            "\n" +
            "    compile('com.qt.qt-common:cron-lib:1.0.0')\n" +
            "\n" +
            "}";
 
    @Test
    public void tsetHighlight() {
        String regex = "org\\.apache";
        List<MatchedFileContent> matched = HighLightUtil.highlight(content, regex);
        Assert.assertTrue(matched.size() > 0);
    }
 
    @Test
    public void testHighlightBase64() {
        String content = "MG1laW5hMiAxbWVpbmEyCjBtZWluYTIgMW1laW5hMgo=";
        String regex = "meina2";
        List<MatchedFileContent> matchedFileContents = HighLightUtil.highlightBase64(content, regex);
        Assert.assertEquals(1, matchedFileContents.size());
        Assert.assertEquals("[MatchedFileContent(startLine=1, partContent=0<qthighlight--meina2--qthighlight> 1<qthighlight--meina2--qthighlight>\n" +
                "0<qthighlight--meina2--qthighlight> 1<qthighlight--meina2--qthighlight>\n" +
                ")]", matchedFileContents.toString());
    }
 
    @Test
    public void testHighLight2() {
        String content = "customdir2  1\n" +
                "customdir2  2\n" +
                "customdir2  3\n" +
                "customdir2  4\n" +
                "customdir2  5\n" +
                "customdir2  6\n" +
                "customdir2  7 customdir2  7 customdir2  7 customdir2  7\n" +
                "customdir2  8 customdir2  8 customdir2  8 customdir2  8";
        String regex = "customdir2";
        List<MatchedFileContent> matchedFileContents = HighLightUtil.highlight(content, regex);
        Assert.assertEquals(2, matchedFileContents.size());
    }
 
    @Test
    public void testHighLight3() {
        String content = "customdir2  1\n" +
                "customdir2  2\n" +
                "customdir3  3\n" +
                "customdir5  4\n" +
                "customdir6  5\n" +
                "customdir9  6\n" +
                "customdir10  7 customdir8  7 customdird  7 customdiro  7\n" +
                "customdir2  8 customdir2  8 customdir2  8 customdir2  8";
        String regex = "customdir2";
        List<MatchedFileContent> matchedFileContents = HighLightUtil.highlight(content, regex);
        Assert.assertEquals(2, matchedFileContents.size());
    }
}


小结

本文讲解了如何运用“构造与使用分离”的思想,来重构和改进高亮展示命中文本内容的算法实现。 构造与使用分离,即是在构造的时候抽取所需的必要信息,而在使用的时候去构建所需要功能,而不是边构建边使用,将构建与使用耦合在一起,后续如果有需求变更,改动就会比较麻烦。

posted @ 2022-01-27 22:43  琴水玉  阅读(196)  评论(2编辑  收藏  举报