Elasticsearch的分词器,IK分词器以及IK分词器权限问题
分词器的概念
Analysis和Analyzer
Analysis
:文本分析是把全文本转换一系列单词(term/token)的过程,也叫分词
Analysis是通过Analyzer来实现的。
当一个文档被索引时,每个Field都可能会创建一个倒排索引(Mapping可以设置不索引该Field)。
倒排索引的过程就是将文档通过Analyzer分成一个一个的Term,每一个Term都指向包含这个Term的文档集合。
当查询query时,Elasticsearch会根据搜索类型决定是否对query进行analyze,然后和倒排索引中的term进行相关性查询,匹配相应的文档。
Analyzer组成
分析器(analyzer)都由三种构件块组成的:character filters
, tokenizers
, token filters
。
character filter 字符过滤器
在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(<span>hello<span> --> hello),& --> and(I&you --> I and you)
tokenizers 分词器
英文分词可以根据空格将单词分开,中文分词比较复杂,可以采用机器学习算法来分词。
Token filters Token过滤器
将切分的单词进行加工。大小写转换(例将“Quick”转为小写),去掉词(例如停用词像“a”、“and”、“the”等等),或者增加词(例如同义词像“jump”和“leap”)。
三者顺序关系
三者顺序
:Character Filters--->Tokenizer--->Token Filter
三者个数
:analyzer = CharFilters(0个或多个) + Tokenizer(恰好一个) + TokenFilters(0个或多个)
Elasticsearch的内置分词器
- Standard Analyzer - 默认分词器,按词切分,小写处理
- Simple Analyzer - 按照非字母切分(符号被过滤), 小写处理
- Stop Analyzer - 小写处理,停用词过滤(the,a,is)
- Whitespace Analyzer - 按照空格切分,不转小写
- Keyword Analyzer - 不分词,直接将输入当作输出
- Patter Analyzer - 正则表达式,默认\W+(非字符分割)
- Language - 提供了30多种常见语言的分词器
- Customer Analyzer 自定义分词器
创建索引时设置分词器
#在setting中自定义一个std_folded分词器
#在mapping中设置title字段使用自定义分词器,context字段使用Whitespace Analyzer分词器
PUT new_index
{
"settings": {
"analysis": {
"analyzer": {
"std_folded":{
"type":"custom",
"tokenizer":"standard",
"filter":["lowercase","asciifolding"]
}
}
}
},
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "std_folded"
},
"content":{
"type": "text",
"analyzer": "whitespace"
}
}
}
}
ES常用的内置分词器
Standard Analyzer(默认)
standard 是默认的分析器。它提供了基于语法的标记化(基于Unicode文本分割算法),适用于大多数语言
POST _analyze
{
"analyzer": "standard",
"text": "Like X 国庆放假的"
}
配置
标准分析器接受下列参数:
max_token_length
: 最大token长度,默认255stopwords
: 预定义的停止词列表,如_english_
或 包含停止词列表的数组,默认是_none_
stopwords_path
: 包含停止词的文件路径
PUT new_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard", #设置分词器为standard
"max_token_length": 5, #设置分词最大为5
"stopwords": "_english_" #设置过滤词
}
}
}
}
}
Simple Analyzer
simple 分析器当它遇到只要不是字母的字符,就将文本解析成term,而且所有的term都是小写的。
POST _analyze
{
"analyzer": "simple",
"text": "Like X 国庆放假 的"
}
Whitespace Analyzer
按照空格切分,不转小写
POST _analyze
{
"analyzer": "whitespace",
"text": "Like X 国庆放假 的"
}
中文分词器-IK
IK分词器安装
- 开源分词器 Ik 的github:https://github.com/medcl/elasticsearch-analysis-ik
注意
IK分词器的版本要你安装ES的版本一致,我这边是7.12.1那么就在github找到对应版本
- 在IDE中直接编译,将
编译目录 /release
中的压缩包直接解压到es安装目录的plugin目录(此处可新建IK文件夹)即可
-
重启ES
注意
安装完插件后需重启Es,才能生效。
IK的使用
IK有两种颗粒度的拆分:
ik_smart
: 会做最粗粒度的拆分
ik_max_word
: 会将文本做最细粒度的拆分
ik_smart
GET /_analyze
{
"text":"中华人民共和国国徽",
"analyzer":"ik_smart"
}
ik_max_word 拆分
GET /_analyze
{
"text":"中华人民共和国国徽",
"analyzer":"ik_max_word"
}
IK使用MySQL为热库(IK和ES均为7.12.1版本)
修改源码
- 在
org/wltea/analyzer/dic
目录中新增HotDictReloadThread
方法
package org.wltea.analyzer.dic;
import org.apache.logging.log4j.Logger;
import org.wltea.analyzer.help.ESPluginLoggerFactory;
public class HotDictReloadThread {
private static final Logger log = ESPluginLoggerFactory.getLogger(HotDictReloadThread.class.getName());
public void initial(){
while (true) {
log.info("正在调用HotDictReloadThread...");
Dictionary.getSingleton().reLoadMainDict();
}
}
}
- 在
org/wltea/analyzer/dic
目录中的Dictionary中新增代码
/**
* IK 中文分词 版本 5.0
* IK Analyzer release 5.0
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* 源代码由林良益(linliangyi2005@gmail.com)提供
* 版权声明 2012,乌龙茶工作室
* provided by Linliangyi and copyright 2012 by Oolong studio
*
*
*/
package org.wltea.analyzer.dic;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.file.attribute.BasicFileAttributes;
import java.nio.file.Files;
import java.nio.file.FileVisitResult;
import java.nio.file.Path;
import java.nio.file.SimpleFileVisitor;
import java.security.AccessController;
import java.security.PrivilegedAction;
import java.sql.*;
import java.util.*;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.elasticsearch.SpecialPermission;
import org.elasticsearch.common.io.PathUtils;
import org.elasticsearch.plugin.analysis.ik.AnalysisIkPlugin;
import org.wltea.analyzer.cfg.Configuration;
import org.apache.logging.log4j.Logger;
import org.wltea.analyzer.help.ESPluginLoggerFactory;
/**
* 词典管理类,单子模式
*/
public class Dictionary {
/*
* 词典单子实例
*/
private static Dictionary singleton;
private DictSegment _MainDict;
private DictSegment _QuantifierDict;
private DictSegment _StopWords;
/**
* 配置对象
*/
private Configuration configuration;
private static final Logger logger = ESPluginLoggerFactory.getLogger(Dictionary.class.getName());
private static ScheduledExecutorService pool = Executors.newScheduledThreadPool(1);
private static final String PATH_DIC_MAIN = "main.dic";
private static final String PATH_DIC_SURNAME = "surname.dic";
private static final String PATH_DIC_QUANTIFIER = "quantifier.dic";
private static final String PATH_DIC_SUFFIX = "suffix.dic";
private static final String PATH_DIC_PREP = "preposition.dic";
private static final String PATH_DIC_STOP = "stopword.dic";
private final static String FILE_NAME = "IKAnalyzer.cfg.xml";
private final static String EXT_DICT = "ext_dict";
private final static String REMOTE_EXT_DICT = "remote_ext_dict";
private final static String EXT_STOP = "ext_stopwords";
private final static String REMOTE_EXT_STOP = "remote_ext_stopwords";
private Path conf_dir;
private Properties props;
private Dictionary(Configuration cfg) {
this.configuration = cfg;
this.props = new Properties();
this.conf_dir = cfg.getEnvironment().configFile().resolve(AnalysisIkPlugin.PLUGIN_NAME);
Path configFile = conf_dir.resolve(FILE_NAME);
InputStream input = null;
try {
logger.info("try load config from {}", configFile);
input = new FileInputStream(configFile.toFile());
} catch (FileNotFoundException e) {
conf_dir = cfg.getConfigInPluginDir();
configFile = conf_dir.resolve(FILE_NAME);
try {
logger.info("try load config from {}", configFile);
input = new FileInputStream(configFile.toFile());
} catch (FileNotFoundException ex) {
// We should report origin exception
logger.error("ik-analyzer", e);
}
}
if (input != null) {
try {
props.loadFromXML(input);
} catch (IOException e) {
logger.error("ik-analyzer", e);
}
}
}
private String getProperty(String key){
if(props!=null){
return props.getProperty(key);
}
return null;
}
/**
* 词典初始化 由于IK Analyzer的词典采用Dictionary类的静态方法进行词典初始化
* 只有当Dictionary类被实际调用时,才会开始载入词典, 这将延长首次分词操作的时间 该方法提供了一个在应用加载阶段就初始化字典的手段
*
* @return Dictionary
*/
public static synchronized void initial(Configuration cfg) {
if (singleton == null) {
synchronized (Dictionary.class) {
if (singleton == null) {
singleton = new Dictionary(cfg);
singleton.loadMainDict();
singleton.loadSurnameDict();
singleton.loadQuantifierDict();
singleton.loadSuffixDict();
singleton.loadPrepDict();
singleton.loadStopWordDict();
//在字典实例初始化完成后新起一个线程来执行字典的热更新操作
pool.execute(() -> new HotDictReloadThread().initial());
if(cfg.isEnableRemoteDict()){
// 建立监控线程
for (String location : singleton.getRemoteExtDictionarys()) {
// 10 秒是初始延迟可以修改的 60是间隔时间 单位秒
pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
}
for (String location : singleton.getRemoteExtStopWordDictionarys()) {
pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
}
}
}
}
}
}
private void walkFileTree(List<String> files, Path path) {
if (Files.isRegularFile(path)) {
files.add(path.toString());
} else if (Files.isDirectory(path)) try {
Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
@Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) {
files.add(file.toString());
return FileVisitResult.CONTINUE;
}
@Override
public FileVisitResult visitFileFailed(Path file, IOException e) {
logger.error("[Ext Loading] listing files", e);
return FileVisitResult.CONTINUE;
}
});
} catch (IOException e) {
logger.error("[Ext Loading] listing files", e);
} else {
logger.warn("[Ext Loading] file not found: " + path);
}
}
private void loadDictFile(DictSegment dict, Path file, boolean critical, String name) {
try (InputStream is = new FileInputStream(file.toFile())) {
BufferedReader br = new BufferedReader(
new InputStreamReader(is, "UTF-8"), 512);
String word = br.readLine();
if (word != null) {
if (word.startsWith("\uFEFF"))
word = word.substring(1);
for (; word != null; word = br.readLine()) {
word = word.trim();
if (word.isEmpty()) continue;
dict.fillSegment(word.toCharArray());
}
}
} catch (FileNotFoundException e) {
logger.error("ik-analyzer: " + name + " not found", e);
if (critical) throw new RuntimeException("ik-analyzer: " + name + " not found!!!", e);
} catch (IOException e) {
logger.error("ik-analyzer: " + name + " loading failed", e);
}
}
private List<String> getExtDictionarys() {
List<String> extDictFiles = new ArrayList<String>(2);
String extDictCfg = getProperty(EXT_DICT);
if (extDictCfg != null) {
String[] filePaths = extDictCfg.split(";");
for (String filePath : filePaths) {
if (filePath != null && !"".equals(filePath.trim())) {
Path file = PathUtils.get(getDictRoot(), filePath.trim());
walkFileTree(extDictFiles, file);
}
}
}
return extDictFiles;
}
private List<String> getRemoteExtDictionarys() {
List<String> remoteExtDictFiles = new ArrayList<String>(2);
String remoteExtDictCfg = getProperty(REMOTE_EXT_DICT);
if (remoteExtDictCfg != null) {
String[] filePaths = remoteExtDictCfg.split(";");
for (String filePath : filePaths) {
if (filePath != null && !"".equals(filePath.trim())) {
remoteExtDictFiles.add(filePath);
}
}
}
return remoteExtDictFiles;
}
private List<String> getExtStopWordDictionarys() {
List<String> extStopWordDictFiles = new ArrayList<String>(2);
String extStopWordDictCfg = getProperty(EXT_STOP);
if (extStopWordDictCfg != null) {
String[] filePaths = extStopWordDictCfg.split(";");
for (String filePath : filePaths) {
if (filePath != null && !"".equals(filePath.trim())) {
Path file = PathUtils.get(getDictRoot(), filePath.trim());
walkFileTree(extStopWordDictFiles, file);
}
}
}
return extStopWordDictFiles;
}
private List<String> getRemoteExtStopWordDictionarys() {
List<String> remoteExtStopWordDictFiles = new ArrayList<String>(2);
String remoteExtStopWordDictCfg = getProperty(REMOTE_EXT_STOP);
if (remoteExtStopWordDictCfg != null) {
String[] filePaths = remoteExtStopWordDictCfg.split(";");
for (String filePath : filePaths) {
if (filePath != null && !"".equals(filePath.trim())) {
remoteExtStopWordDictFiles.add(filePath);
}
}
}
return remoteExtStopWordDictFiles;
}
private String getDictRoot() {
return conf_dir.toAbsolutePath().toString();
}
/**
* 获取词典单子实例
*
* @return Dictionary 单例对象
*/
public static Dictionary getSingleton() {
if (singleton == null) {
throw new IllegalStateException("ik dict has not been initialized yet, please call initial method first.");
}
return singleton;
}
/**
* 批量加载新词条
*
* @param words
* Collection<String>词条列表
*/
public void addWords(Collection<String> words) {
if (words != null) {
for (String word : words) {
if (word != null) {
// 批量加载词条到主内存词典中
singleton._MainDict.fillSegment(word.trim().toCharArray());
}
}
}
}
/**
* 批量移除(屏蔽)词条
*/
public void disableWords(Collection<String> words) {
if (words != null) {
for (String word : words) {
if (word != null) {
// 批量屏蔽词条
singleton._MainDict.disableSegment(word.trim().toCharArray());
}
}
}
}
/**
* 检索匹配主词典
*
* @return Hit 匹配结果描述
*/
public Hit matchInMainDict(char[] charArray) {
return singleton._MainDict.match(charArray);
}
/**
* 检索匹配主词典
*
* @return Hit 匹配结果描述
*/
public Hit matchInMainDict(char[] charArray, int begin, int length) {
return singleton._MainDict.match(charArray, begin, length);
}
/**
* 检索匹配量词词典
*
* @return Hit 匹配结果描述
*/
public Hit matchInQuantifierDict(char[] charArray, int begin, int length) {
return singleton._QuantifierDict.match(charArray, begin, length);
}
/**
* 从已匹配的Hit中直接取出DictSegment,继续向下匹配
*
* @return Hit
*/
public Hit matchWithHit(char[] charArray, int currentIndex, Hit matchedHit) {
DictSegment ds = matchedHit.getMatchedDictSegment();
return ds.match(charArray, currentIndex, 1, matchedHit);
}
/**
* 判断是否是停止词
*
* @return boolean
*/
public boolean isStopWord(char[] charArray, int begin, int length) {
return singleton._StopWords.match(charArray, begin, length).isMatch();
}
/**
* 加载主词典及扩展词典
*/
private void loadMainDict() {
// 建立一个主词典实例
_MainDict = new DictSegment((char) 0);
// 读取主词典文件
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN);
loadDictFile(_MainDict, file, false, "Main Dict");
// 加载扩展词典
this.loadExtDict();
// 加载远程自定义词库
this.loadRemoteExtDict();
//加载mysql热库
this.loadMySQLExtDict();
}
/**
* 加载用户配置的扩展词典到主词库表
*/
private void loadExtDict() {
// 加载扩展词典配置
List<String> extDictFiles = getExtDictionarys();
if (extDictFiles != null) {
for (String extDictName : extDictFiles) {
// 读取扩展词典文件
logger.info("[Dict Loading] " + extDictName);
Path file = PathUtils.get(extDictName);
loadDictFile(_MainDict, file, false, "Extra Dict");
}
}
}
/**
* 加载远程扩展词典到主词库表
*/
private void loadRemoteExtDict() {
List<String> remoteExtDictFiles = getRemoteExtDictionarys();
for (String location : remoteExtDictFiles) {
logger.info("[Dict Loading] " + location);
List<String> lists = getRemoteWords(location);
// 如果找不到扩展的字典,则忽略
if (lists == null) {
logger.error("[Dict Loading] " + location + " load failed");
continue;
}
for (String theWord : lists) {
if (theWord != null && !"".equals(theWord.trim())) {
// 加载扩展词典数据到主内存词典中
logger.info(theWord);
_MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
}
}
}
}
/**
* 从MySql中加载动态词库
*/
private void loadMySQLExtDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
props.load(new FileInputStream(file.toFile()));
logger.info("[==========]jdbc-reload.properties");
for(Object key : props.keySet()) {
logger.info("[==========]" + key + "=" + props.getProperty(String.valueOf(key)));
}
logger.info("[==========]query hot dict from mysql, " + props.getProperty("jdbc.reload.sql") + "......");
Class.forName(props.getProperty("jdbc.className"));
conn = DriverManager.getConnection(
props.getProperty("jdbc.url"),
props.getProperty("jdbc.user"),
props.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(props.getProperty("jdbc.reload.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("[==========]正在加载Mysql自定义IK扩展词库词条: " + theWord);
_MainDict.fillSegment(theWord.trim().toCharArray());
}
Thread.sleep(Integer.valueOf(String.valueOf(props.get("jdbc.reload.interval"))) * 1000);
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
/**
* 从MySql中加载远程停用词库
*/
private void loadMySQLStopwordDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
props.load(new FileInputStream(file.toFile()));
logger.info("[==========]jdbc-reload.properties");
for(Object key : props.keySet()) {
logger.info("[==========]" + key + "=" + props.getProperty(String.valueOf(key)));
}
logger.info("[==========]query hot stopword dict from mysql, " + props.getProperty("jdbc.reload.stopword.sql") + "......");
Class.forName(props.getProperty("jdbc.className"));
conn = DriverManager.getConnection(
props.getProperty("jdbc.url"),
props.getProperty("jdbc.user"),
props.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(props.getProperty("jdbc.reload.stopword.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("[==========]正在加载Mysql自定义IK停用词库词条: " + theWord);
_StopWords.fillSegment(theWord.trim().toCharArray());
}
Thread.sleep(Integer.parseInt(String.valueOf(props.get("jdbc.reload.interval"))) * 1000L);
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
private static List<String> getRemoteWords(String location) {
SpecialPermission.check();
return AccessController.doPrivileged((PrivilegedAction<List<String>>) () -> {
return getRemoteWordsUnprivileged(location);
});
}
/**
* 从远程服务器上下载自定义词条
*/
private static List<String> getRemoteWordsUnprivileged(String location) {
List<String> buffer = new ArrayList<String>();
RequestConfig rc = RequestConfig.custom().setConnectionRequestTimeout(10 * 1000).setConnectTimeout(10 * 1000)
.setSocketTimeout(60 * 1000).build();
CloseableHttpClient httpclient = HttpClients.createDefault();
CloseableHttpResponse response;
BufferedReader in;
HttpGet get = new HttpGet(location);
get.setConfig(rc);
try {
response = httpclient.execute(get);
if (response.getStatusLine().getStatusCode() == 200) {
String charset = "UTF-8";
// 获取编码,默认为utf-8
HttpEntity entity = response.getEntity();
if(entity!=null){
Header contentType = entity.getContentType();
if(contentType!=null&&contentType.getValue()!=null){
String typeValue = contentType.getValue();
if(typeValue!=null&&typeValue.contains("charset=")){
charset = typeValue.substring(typeValue.lastIndexOf("=") + 1);
}
}
if (entity.getContentLength() > 0 || entity.isChunked()) {
in = new BufferedReader(new InputStreamReader(entity.getContent(), charset));
String line;
while ((line = in.readLine()) != null) {
buffer.add(line);
}
in.close();
response.close();
return buffer;
}
}
}
response.close();
} catch (IllegalStateException | IOException e) {
logger.error("getRemoteWords {} error", e, location);
}
return buffer;
}
/**
* 加载用户扩展的停止词词典
*/
private void loadStopWordDict() {
// 建立主词典实例
_StopWords = new DictSegment((char) 0);
// 读取主词典文件
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_STOP);
loadDictFile(_StopWords, file, false, "Main Stopwords");
// 加载扩展停止词典
List<String> extStopWordDictFiles = getExtStopWordDictionarys();
if (extStopWordDictFiles != null) {
for (String extStopWordDictName : extStopWordDictFiles) {
logger.info("[Dict Loading] " + extStopWordDictName);
// 读取扩展词典文件
file = PathUtils.get(extStopWordDictName);
loadDictFile(_StopWords, file, false, "Extra Stopwords");
}
}
// 加载远程停用词典
List<String> remoteExtStopWordDictFiles = getRemoteExtStopWordDictionarys();
for (String location : remoteExtStopWordDictFiles) {
logger.info("[Dict Loading] " + location);
List<String> lists = getRemoteWords(location);
// 如果找不到扩展的字典,则忽略
if (lists == null) {
logger.error("[Dict Loading] " + location + " load failed");
continue;
}
for (String theWord : lists) {
if (theWord != null && !"".equals(theWord.trim())) {
// 加载远程词典数据到主内存中
logger.info(theWord);
_StopWords.fillSegment(theWord.trim().toLowerCase().toCharArray());
}
}
}
}
/**
* 加载量词词典
*/
private void loadQuantifierDict() {
// 建立一个量词典实例
_QuantifierDict = new DictSegment((char) 0);
// 读取量词词典文件
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_QUANTIFIER);
loadDictFile(_QuantifierDict, file, false, "Quantifier");
}
private void loadSurnameDict() {
DictSegment _SurnameDict = new DictSegment((char) 0);
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_SURNAME);
loadDictFile(_SurnameDict, file, true, "Surname");
}
private void loadSuffixDict() {
DictSegment _SuffixDict = new DictSegment((char) 0);
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_SUFFIX);
loadDictFile(_SuffixDict, file, true, "Suffix");
}
private void loadPrepDict() {
DictSegment _PrepDict = new DictSegment((char) 0);
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_PREP);
loadDictFile(_PrepDict, file, true, "Preposition");
}
void reLoadMainDict() {
logger.info("start to reload ik dict.");
// 新开一个实例加载词典,减少加载过程对当前词典使用的影响
Dictionary tmpDict = new Dictionary(configuration);
tmpDict.configuration = getSingleton().configuration;
tmpDict.loadMainDict();
tmpDict.loadStopWordDict();
_MainDict = tmpDict._MainDict;
_StopWords = tmpDict._StopWords;
logger.info("reload ik dict finished.");
}
}
-
在
config
目录下新增数据库相关配置文件jdbc-reload.properties
# 数据库地址 jdbc.url=jdbc:mysql://192.168.232.128:3306/ik_test?serverTimezone=GMT&autoReconnect=true&useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useAffectedRows=true&useSSL=false # 数据库用户名 jdbc.user=root # 数据库密码 jdbc.password=123456 # 数据库查询扩展词库sql语句 jdbc.reload.sql=select gel.lexicon_text as word from es_lexicon gel where gel.lexicon_type = 0 and gel.lexicon_status = 0 and gel.del_flag = 0 order by gel.lexicon_id desc # 数据库查询停用词sql语句 jdbc.reload.stopword.sql=select gel.lexicon_text as word from ges_lexicon gel where gel.lexicon_type = 1 and gel.lexicon_status = 0 and gel.del_flag = 0 order by gel.lexicon_id desc # 数据库查询间隔时间 每隔10秒请求一次 jdbc.reload.interval=10
-
建表语句
SET NAMES utf8mb4; SET FOREIGN_KEY_CHECKS = 0; -- ---------------------------- -- Table structure for es_lexicon -- ---------------------------- DROP TABLE IF EXISTS `es_lexicon`; CREATE TABLE `es_lexicon` ( `lexicon_id` bigint(8) NOT NULL AUTO_INCREMENT COMMENT '词库id', `lexicon_text` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '词条关键词', `lexicon_type` int(1) NOT NULL DEFAULT 0 COMMENT '0扩展词库 1停用词库', `lexicon_status` int(1) NOT NULL DEFAULT 0 COMMENT '词条状态 0正常 1暂停使用', `del_flag` int(1) NOT NULL DEFAULT 0 COMMENT '作废标志 0正常 1作废', `create_time` datetime(0) NOT NULL DEFAULT CURRENT_TIMESTAMP(0) COMMENT '创建时间', PRIMARY KEY (`lexicon_id`) USING BTREE ) ENGINE = InnoDB AUTO_INCREMENT = 2 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci COMMENT = 'ES远程扩展词库表' ROW_FORMAT = DYNAMIC; -- ---------------------------- -- Records of es_lexicon -- ---------------------------- INSERT INTO `es_lexicon` VALUES (1, '大脸猫', 0, 0, 1, '2021-05-26 22:33:40'); SET FOREIGN_KEY_CHECKS = 1;
编译,执行命令 maven install
将target/releases/elasticsearch-analysis-ik-7.12.1.zip
解压到(es安装路径)/plugin/ik文件夹中
将对应mysql的驱动拷贝到(es安装路径)/plugin/ik文件夹中
重启ES
效果
ES安装中文分词器+MySQL出现权限问题
问题描述
出现access denied java.net.SocketPermission
AccessControlException: access denied ("java.net.SocketPermission" "127.0.0.1:3306" "connect,resolve")
access denied java.security.AccessControlException: access denied
access denied java.lang.RuntimePermission "setContextClassLoader";
解决方案
经自已实践,在/usr/soft/es/elasticsearch-7.12.1/jdk/conf/security/java.policy
文件中添加以下代码可以解决(不一定是最好的)
permission java.net.SocketPermission "*", "connect,resolve";
permission java.lang.RuntimePermission "setContextClassLoader";
permission java.security.SecurityPermission "*";