elasticsearch修改Ik分词器源码实现基于MySQL更新分词
本文主要记录如何修改Ik分词器源码来实现基于MySQL数据库更新分词,所有步骤均为本人实际操作验证。如果你也刚好刷到这篇文章,希望对你有所帮助。
使用过Ik分词器的应该都知道,它提供了三种配置热词词库的方式:
- Ik内置词库
- Ik外置静态词库
- Ik远程词库
具体可以去看Ik的配置文件,这里不展开说明。通过对源码中定时更新远程词库的阅读,这里采用同样的方式来定时从MySQL数据库获取分词词库。
1. 下载源码
从官网代码库获取跟自己安装的Elasticsearch对应的tag版本,下载地址,这里我选择的是7.9.3版本。
这里我们重点目标是org.wltea.analyzer.dic.Dictionary
这是它的词典管理类。
2. 添加MySQL驱动依赖,配置定义和驱动加载
(1) 在源码pom文件中添加MySQL驱动依赖,如下所示。
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.13</version>
</dependency>
(2)在config目录添加数据库链接配置文件 jdbc-mysql-dict.properties
,文件内容如下:
jdbc.url=jdbc:mysql://127.0.0.1:3306/stop_word?useUnicode=true&characterEncoding=UTF-8&characterSetResults=UTF-8
jdbc.user=root
jdbc.password=123456
jdbc.reload.sql=select word from hot_words
jdbc.reload.stopword.sql=select stopword as word from hot_stopwords
(3)在Dictionary类中增加代码加载MySQL驱动
private static Properties prop = new Properties();
static {
try {
Class.forName("com.mysql.cj.jdbc.Driver");
} catch (ClassNotFoundException e) {
logger.error("MySQL驱动加载异常:", e);
}
}
(4)创建数据库stop_word并添加两张表hot_words和hot_stopwords
3. 在Dictionary类中增加方法更新词典
更新词典
/**
* 更新词典
*/
private void loadMySQLStopwordDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-loadext.properties");
prop.load(new FileInputStream(file.toFile()));
logger.info("jdbc-reload.properties");
for(Object key : prop.keySet()) {
logger.info(key + "=" + prop.getProperty(String.valueOf(key)));
}
logger.info("query hot stopword dict from mysql, " + prop.getProperty("jdbc.reload.stopword.sql") + "......");
conn = DriverManager.getConnection(
prop.getProperty("jdbc.url"),
prop.getProperty("jdbc.user"),
prop.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(prop.getProperty("jdbc.reload.stopword.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("hot stopword from mysql: " + theWord);
_StopWords.fillSegment(theWord.trim().toCharArray());
}
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
更新停用词
private void loadMySQLStopwordDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-loadext.properties");
prop.load(new FileInputStream(file.toFile()));
logger.info("jdbc-reload.properties");
for(Object key : prop.keySet()) {
logger.info(key + "=" + prop.getProperty(String.valueOf(key)));
}
logger.info("query hot stopword dict from mysql, " + prop.getProperty("jdbc.reload.stopword.sql") + "......");
conn = DriverManager.getConnection(
prop.getProperty("jdbc.url"),
prop.getProperty("jdbc.user"),
prop.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(prop.getProperty("jdbc.reload.stopword.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("hot stopword from mysql: " + theWord);
_StopWords.fillSegment(theWord.trim().toCharArray());
}
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
对外暴露方法:
public void reLoadSQLDict() {
this.loadMySQLExtDict();
this.loadMySQLStopwordDict();
}
增加线程类MySQLDictReloadThread去更新分词:
package org.wltea.analyzer.dic;
import org.apache.logging.log4j.Logger;
import org.wltea.analyzer.help.ESPluginLoggerFactory;
/**
* @description:
* @date: 2024/11/29
**/
public class MySQLDictReloadThread implements Runnable{
private static final Logger logger = ESPluginLoggerFactory.getLogger(MySQLDictReloadThread.class.getName());
@Override
public void run() {
logger.info("reloading hot_word and stop_word dict from mysql");
Dictionary.getSingleton().reLoadSQLDict();
}
}
仿照源码,增加定时调度:
4. 重新打包
这里参照ik源码,修改对应的src/main/assemblies/plugin.xml文件,将MySQL驱动包含进去
<dependencySet>
<outputDirectory/>
<useProjectArtifact>true</useProjectArtifact>
<useTransitiveFiltering>true</useTransitiveFiltering>
<includes>
<include>mysql:mysql-connector-java</include>
</includes>
</dependencySet>
就下来将ik上传到elasticsearch插件目录下,可参照这篇文章elasticsearch安装ik分词器,然后重启es。
这里我遇到一点问题,启动后日志会报错,提示权限不足,这是由于es的启用了Java安全管理器
,那么我们去增加对应权限即可。这里我一开始修改的是es安装目录下的/elasticsearch-7.9.3/jdk/conf/security/java.policy文件,结果并没有生效,这里我们去看es的启动脚本,通过日志输出可以很清晰的看到es使用的是本机的Java环境:
[2024-12-02T15:31:49,913][INFO ][o.e.n.Node ] [node-1] JVM home [/usr/local/java/jdk1.8.0_152/jre]
这里将需要的权限都添加到/usr/local/java/jdk1.8.0_152/jre/lib/security/java.policy
文件中:
permission java.lang.reflect.ReflectPermission "suppressAccessChecks";
permission java.io.FilePermission "<<ALL FILES>>", "read,write,delete";
permission java.util.PropertyPermission "*", "read,write";
permission javax.security.auth.AuthPermission "getSubject";
permission javax.security.auth.AuthPermission "modifyPrincipals";
permission javax.security.auth.AuthPermission "getLoginConfiguration";
permission java.lang.RuntimePermission "createClassLoader";
permission java.lang.RuntimePermission "getClassLoader";
permission java.lang.RuntimePermission "setContextClassLoader";
permission java.lang.RuntimePermission "accessClassInPackage.sun.misc";
permission java.lang.RuntimePermission "accessClassInPackage.sun.nio.ch";
permission java.lang.RuntimePermission "accessDeclaredMembers";
permission java.lang.RuntimePermission "loadLibrary.jaas_unix";
permission java.lang.RuntimePermission "shutdownHooks";
permission java.lang.RuntimePermission "createSecurityManager";
permission java.lang.RuntimePermission "closeClassLoader";
permission java.net.SocketPermission "127.0.0.1:3306","connect,resolve";
最后添加的那条是为了连接数据库服务器添加的。
至此 我们再次重启es服务,已经能正常运行了。
5. 验证词库效果
这里我们先看一下未添加词库前的
往数据库插入两个词 恰斯卡和纳塔,从日志可以看出已经加载进去了:
再来看一下效果:
功能基本实现,这是一个demo,在实际使用中可以在此基础再做进一步的扩展。ok本文到此结束。谢谢阅读。