Elasticsearch的分词器,IK分词器以及IK分词器权限问题

分词器的概念

Analysis和Analyzer

Analysis:文本分析是把全文本转换一系列单词(term/token)的过程,也叫分词

Analysis是通过Analyzer来实现的

当一个文档被索引时,每个Field都可能会创建一个倒排索引(Mapping可以设置不索引该Field)。

倒排索引的过程就是将文档通过Analyzer分成一个一个的Term,每一个Term都指向包含这个Term的文档集合。

当查询query时,Elasticsearch会根据搜索类型决定是否对query进行analyze,然后和倒排索引中的term进行相关性查询,匹配相应的文档。

Analyzer组成

分析器(analyzer)都由三种构件块组成的:character filterstokenizerstoken filters

character filter 字符过滤器

在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(<span>hello<span> --> hello),& --> and(I&you --> I and you)

tokenizers 分词器

英文分词可以根据空格将单词分开,中文分词比较复杂,可以采用机器学习算法来分词。

Token filters Token过滤器

将切分的单词进行加工。大小写转换(例将“Quick”转为小写),去掉词(例如停用词像“a”、“and”、“the”等等),或者增加词(例如同义词像“jump”和“leap”)。

三者顺序关系

三者顺序Character Filters--->Tokenizer--->Token Filter

三者个数analyzer = CharFilters(0个或多个) + Tokenizer(恰好一个) + TokenFilters(0个或多个)

Elasticsearch的内置分词器

  • Standard Analyzer - 默认分词器,按词切分,小写处理
  • Simple Analyzer - 按照非字母切分(符号被过滤), 小写处理
  • Stop Analyzer - 小写处理,停用词过滤(the,a,is)
  • Whitespace Analyzer - 按照空格切分,不转小写
  • Keyword Analyzer - 不分词,直接将输入当作输出
  • Patter Analyzer - 正则表达式,默认\W+(非字符分割)
  • Language - 提供了30多种常见语言的分词器
  • Customer Analyzer 自定义分词器

创建索引时设置分词器

#在setting中自定义一个std_folded分词器
#在mapping中设置title字段使用自定义分词器,context字段使用Whitespace Analyzer分词器
PUT new_index
{
 "settings": {
   "analysis": {
     "analyzer": {
       "std_folded":{
         "type":"custom",
         "tokenizer":"standard",
         "filter":["lowercase","asciifolding"]
       }
     }
   }
 },
 "mappings": {
   "properties": {
     "title":{
       "type": "text",
       "analyzer": "std_folded"
     },
     "content":{
       "type": "text",
       "analyzer": "whitespace"
     }
   }
 }
}

ES常用的内置分词器

Standard Analyzer(默认)

standard 是默认的分析器。它提供了基于语法的标记化(基于Unicode文本分割算法),适用于大多数语言

POST _analyze
{
  "analyzer": "standard",
  "text":     "Like X 国庆放假的"
}

image-20210526102337595

配置

标准分析器接受下列参数:

  • max_token_length : 最大token长度,默认255
  • stopwords : 预定义的停止词列表,如_english_或 包含停止词列表的数组,默认是 _none_
  • stopwords_path : 包含停止词的文件路径
PUT new_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",       #设置分词器为standard
          "max_token_length": 5,    #设置分词最大为5
          "stopwords": "_english_"  #设置过滤词
        }
      }
    }
  }
}

Simple Analyzer

simple 分析器当它遇到只要不是字母的字符,就将文本解析成term,而且所有的term都是小写的。

POST _analyze
{
  "analyzer": "simple",
  "text":     "Like X 国庆放假 的"
}


image-20210526103437916

Whitespace Analyzer

按照空格切分,不转小写

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "Like X 国庆放假 的"
}

image-20210526103541861

中文分词器-IK

IK分词器安装

  1. 开源分词器 Ik 的github:https://github.com/medcl/elasticsearch-analysis-ik

注意 IK分词器的版本要你安装ES的版本一致,我这边是7.12.1那么就在github找到对应版本

因为要自定义热词库,因此下载的是源码版本,不需要自定义可以直接下载release版本

  1. 在IDE中直接编译,将 编译目录 /release中的压缩包直接解压到es安装目录的plugin目录(此处可新建IK文件夹)即可

image-20210526104229387

image-20210526104254836

  1. 重启ES

    注意 安装完插件后需重启Es,才能生效。

IK的使用

IK有两种颗粒度的拆分:

ik_smart: 会做最粗粒度的拆分

ik_max_word: 会将文本做最细粒度的拆分

ik_smart

GET /_analyze
{
  "text":"中华人民共和国国徽",
  "analyzer":"ik_smart"
}

image-20210526104602240

ik_max_word 拆分

GET /_analyze
{
  "text":"中华人民共和国国徽",
  "analyzer":"ik_max_word"
}

image-20210526104640059

IK使用MySQL为热库(IK和ES均为7.12.1版本)

修改源码

  1. org/wltea/analyzer/dic目录中新增HotDictReloadThread方法
package org.wltea.analyzer.dic;

import org.apache.logging.log4j.Logger;
import org.wltea.analyzer.help.ESPluginLoggerFactory;

public class HotDictReloadThread {
    private static final Logger log = ESPluginLoggerFactory.getLogger(HotDictReloadThread.class.getName());
    public void initial(){
        while (true) {
            log.info("正在调用HotDictReloadThread...");
            Dictionary.getSingleton().reLoadMainDict();
        }
    }
}

  1. org/wltea/analyzer/dic目录中的Dictionary中新增代码
/**
 * IK 中文分词  版本 5.0
 * IK Analyzer release 5.0
 *
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 * 源代码由林良益(linliangyi2005@gmail.com)提供
 * 版权声明 2012,乌龙茶工作室
 * provided by Linliangyi and copyright 2012 by Oolong studio
 *
 *
 */
package org.wltea.analyzer.dic;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.file.attribute.BasicFileAttributes;
import java.nio.file.Files;
import java.nio.file.FileVisitResult;
import java.nio.file.Path;
import java.nio.file.SimpleFileVisitor;
import java.security.AccessController;
import java.security.PrivilegedAction;
import java.sql.*;
import java.util.*;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.elasticsearch.SpecialPermission;
import org.elasticsearch.common.io.PathUtils;
import org.elasticsearch.plugin.analysis.ik.AnalysisIkPlugin;
import org.wltea.analyzer.cfg.Configuration;
import org.apache.logging.log4j.Logger;
import org.wltea.analyzer.help.ESPluginLoggerFactory;


/**
 * 词典管理类,单子模式
 */
public class Dictionary {

	/*
	 * 词典单子实例
	 */
	private static Dictionary singleton;

	private DictSegment _MainDict;

	private DictSegment _QuantifierDict;

	private DictSegment _StopWords;

	/**
	 * 配置对象
	 */
	private Configuration configuration;

	private static final Logger logger = ESPluginLoggerFactory.getLogger(Dictionary.class.getName());

	private static ScheduledExecutorService pool = Executors.newScheduledThreadPool(1);

	private static final String PATH_DIC_MAIN = "main.dic";
	private static final String PATH_DIC_SURNAME = "surname.dic";
	private static final String PATH_DIC_QUANTIFIER = "quantifier.dic";
	private static final String PATH_DIC_SUFFIX = "suffix.dic";
	private static final String PATH_DIC_PREP = "preposition.dic";
	private static final String PATH_DIC_STOP = "stopword.dic";

	private final static  String FILE_NAME = "IKAnalyzer.cfg.xml";
	private final static  String EXT_DICT = "ext_dict";
	private final static  String REMOTE_EXT_DICT = "remote_ext_dict";
	private final static  String EXT_STOP = "ext_stopwords";
	private final static  String REMOTE_EXT_STOP = "remote_ext_stopwords";

	private Path conf_dir;
	private Properties props;

	private Dictionary(Configuration cfg) {
		this.configuration = cfg;
		this.props = new Properties();
		this.conf_dir = cfg.getEnvironment().configFile().resolve(AnalysisIkPlugin.PLUGIN_NAME);
		Path configFile = conf_dir.resolve(FILE_NAME);

		InputStream input = null;
		try {
			logger.info("try load config from {}", configFile);
			input = new FileInputStream(configFile.toFile());
		} catch (FileNotFoundException e) {
			conf_dir = cfg.getConfigInPluginDir();
			configFile = conf_dir.resolve(FILE_NAME);
			try {
				logger.info("try load config from {}", configFile);
				input = new FileInputStream(configFile.toFile());
			} catch (FileNotFoundException ex) {
				// We should report origin exception
				logger.error("ik-analyzer", e);
			}
		}
		if (input != null) {
			try {
				props.loadFromXML(input);
			} catch (IOException e) {
				logger.error("ik-analyzer", e);
			}
		}
	}

	private String getProperty(String key){
		if(props!=null){
			return props.getProperty(key);
		}
		return null;
	}
	/**
	 * 词典初始化 由于IK Analyzer的词典采用Dictionary类的静态方法进行词典初始化
	 * 只有当Dictionary类被实际调用时,才会开始载入词典, 这将延长首次分词操作的时间 该方法提供了一个在应用加载阶段就初始化字典的手段
	 * 
	 * @return Dictionary
	 */
	public static synchronized void initial(Configuration cfg) {
		if (singleton == null) {
			synchronized (Dictionary.class) {
				if (singleton == null) {

					singleton = new Dictionary(cfg);
					singleton.loadMainDict();
					singleton.loadSurnameDict();
					singleton.loadQuantifierDict();
					singleton.loadSuffixDict();
					singleton.loadPrepDict();
					singleton.loadStopWordDict();
					//在字典实例初始化完成后新起一个线程来执行字典的热更新操作
					pool.execute(() -> new HotDictReloadThread().initial());

					if(cfg.isEnableRemoteDict()){
						// 建立监控线程
						for (String location : singleton.getRemoteExtDictionarys()) {
							// 10 秒是初始延迟可以修改的 60是间隔时间 单位秒
							pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
						}
						for (String location : singleton.getRemoteExtStopWordDictionarys()) {
							pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
						}
					}

				}
			}
		}
	}

	private void walkFileTree(List<String> files, Path path) {
		if (Files.isRegularFile(path)) {
			files.add(path.toString());
		} else if (Files.isDirectory(path)) try {
			Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
				@Override
				public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) {
					files.add(file.toString());
					return FileVisitResult.CONTINUE;
				}
				@Override
				public FileVisitResult visitFileFailed(Path file, IOException e) {
					logger.error("[Ext Loading] listing files", e);
					return FileVisitResult.CONTINUE;
				}
			});
		} catch (IOException e) {
			logger.error("[Ext Loading] listing files", e);
		} else {
			logger.warn("[Ext Loading] file not found: " + path);
		}
	}

	private void loadDictFile(DictSegment dict, Path file, boolean critical, String name) {
		try (InputStream is = new FileInputStream(file.toFile())) {
			BufferedReader br = new BufferedReader(
					new InputStreamReader(is, "UTF-8"), 512);
			String word = br.readLine();
			if (word != null) {
				if (word.startsWith("\uFEFF"))
					word = word.substring(1);
				for (; word != null; word = br.readLine()) {
					word = word.trim();
					if (word.isEmpty()) continue;
					dict.fillSegment(word.toCharArray());
				}
			}
		} catch (FileNotFoundException e) {
			logger.error("ik-analyzer: " + name + " not found", e);
			if (critical) throw new RuntimeException("ik-analyzer: " + name + " not found!!!", e);
		} catch (IOException e) {
			logger.error("ik-analyzer: " + name + " loading failed", e);
		}
	}

	private List<String> getExtDictionarys() {
		List<String> extDictFiles = new ArrayList<String>(2);
		String extDictCfg = getProperty(EXT_DICT);
		if (extDictCfg != null) {

			String[] filePaths = extDictCfg.split(";");
			for (String filePath : filePaths) {
				if (filePath != null && !"".equals(filePath.trim())) {
					Path file = PathUtils.get(getDictRoot(), filePath.trim());
					walkFileTree(extDictFiles, file);

				}
			}
		}
		return extDictFiles;
	}

	private List<String> getRemoteExtDictionarys() {
		List<String> remoteExtDictFiles = new ArrayList<String>(2);
		String remoteExtDictCfg = getProperty(REMOTE_EXT_DICT);
		if (remoteExtDictCfg != null) {

			String[] filePaths = remoteExtDictCfg.split(";");
			for (String filePath : filePaths) {
				if (filePath != null && !"".equals(filePath.trim())) {
					remoteExtDictFiles.add(filePath);

				}
			}
		}
		return remoteExtDictFiles;
	}

	private List<String> getExtStopWordDictionarys() {
		List<String> extStopWordDictFiles = new ArrayList<String>(2);
		String extStopWordDictCfg = getProperty(EXT_STOP);
		if (extStopWordDictCfg != null) {

			String[] filePaths = extStopWordDictCfg.split(";");
			for (String filePath : filePaths) {
				if (filePath != null && !"".equals(filePath.trim())) {
					Path file = PathUtils.get(getDictRoot(), filePath.trim());
					walkFileTree(extStopWordDictFiles, file);

				}
			}
		}
		return extStopWordDictFiles;
	}

	private List<String> getRemoteExtStopWordDictionarys() {
		List<String> remoteExtStopWordDictFiles = new ArrayList<String>(2);
		String remoteExtStopWordDictCfg = getProperty(REMOTE_EXT_STOP);
		if (remoteExtStopWordDictCfg != null) {

			String[] filePaths = remoteExtStopWordDictCfg.split(";");
			for (String filePath : filePaths) {
				if (filePath != null && !"".equals(filePath.trim())) {
					remoteExtStopWordDictFiles.add(filePath);

				}
			}
		}
		return remoteExtStopWordDictFiles;
	}

	private String getDictRoot() {
		return conf_dir.toAbsolutePath().toString();
	}


	/**
	 * 获取词典单子实例
	 * 
	 * @return Dictionary 单例对象
	 */
	public static Dictionary getSingleton() {
		if (singleton == null) {
			throw new IllegalStateException("ik dict has not been initialized yet, please call initial method first.");
		}
		return singleton;
	}


	/**
	 * 批量加载新词条
	 * 
	 * @param words
	 *            Collection<String>词条列表
	 */
	public void addWords(Collection<String> words) {
		if (words != null) {
			for (String word : words) {
				if (word != null) {
					// 批量加载词条到主内存词典中
					singleton._MainDict.fillSegment(word.trim().toCharArray());
				}
			}
		}
	}

	/**
	 * 批量移除(屏蔽)词条
	 */
	public void disableWords(Collection<String> words) {
		if (words != null) {
			for (String word : words) {
				if (word != null) {
					// 批量屏蔽词条
					singleton._MainDict.disableSegment(word.trim().toCharArray());
				}
			}
		}
	}

	/**
	 * 检索匹配主词典
	 * 
	 * @return Hit 匹配结果描述
	 */
	public Hit matchInMainDict(char[] charArray) {
		return singleton._MainDict.match(charArray);
	}

	/**
	 * 检索匹配主词典
	 * 
	 * @return Hit 匹配结果描述
	 */
	public Hit matchInMainDict(char[] charArray, int begin, int length) {
		return singleton._MainDict.match(charArray, begin, length);
	}

	/**
	 * 检索匹配量词词典
	 * 
	 * @return Hit 匹配结果描述
	 */
	public Hit matchInQuantifierDict(char[] charArray, int begin, int length) {
		return singleton._QuantifierDict.match(charArray, begin, length);
	}

	/**
	 * 从已匹配的Hit中直接取出DictSegment,继续向下匹配
	 * 
	 * @return Hit
	 */
	public Hit matchWithHit(char[] charArray, int currentIndex, Hit matchedHit) {
		DictSegment ds = matchedHit.getMatchedDictSegment();
		return ds.match(charArray, currentIndex, 1, matchedHit);
	}

	/**
	 * 判断是否是停止词
	 * 
	 * @return boolean
	 */
	public boolean isStopWord(char[] charArray, int begin, int length) {
		return singleton._StopWords.match(charArray, begin, length).isMatch();
	}

	/**
	 * 加载主词典及扩展词典
	 */
	private void loadMainDict() {
		// 建立一个主词典实例
		_MainDict = new DictSegment((char) 0);

		// 读取主词典文件
		Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN);
		loadDictFile(_MainDict, file, false, "Main Dict");
		// 加载扩展词典
		this.loadExtDict();
		// 加载远程自定义词库
		this.loadRemoteExtDict();
		//加载mysql热库
		this.loadMySQLExtDict();
	}

	/**
	 * 加载用户配置的扩展词典到主词库表
	 */
	private void loadExtDict() {
		// 加载扩展词典配置
		List<String> extDictFiles = getExtDictionarys();
		if (extDictFiles != null) {
			for (String extDictName : extDictFiles) {
				// 读取扩展词典文件
				logger.info("[Dict Loading] " + extDictName);
				Path file = PathUtils.get(extDictName);
				loadDictFile(_MainDict, file, false, "Extra Dict");
			}
		}
	}

	/**
	 * 加载远程扩展词典到主词库表
	 */
	private void loadRemoteExtDict() {
		List<String> remoteExtDictFiles = getRemoteExtDictionarys();
		for (String location : remoteExtDictFiles) {
			logger.info("[Dict Loading] " + location);
			List<String> lists = getRemoteWords(location);
			// 如果找不到扩展的字典,则忽略
			if (lists == null) {
				logger.error("[Dict Loading] " + location + " load failed");
				continue;
			}
			for (String theWord : lists) {
				if (theWord != null && !"".equals(theWord.trim())) {
					// 加载扩展词典数据到主内存词典中
					logger.info(theWord);
					_MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
				}
			}
		}

	}


	/**
	 * 从MySql中加载动态词库
	 */
	private void loadMySQLExtDict() {
		Connection conn = null;
		Statement stmt = null;
		ResultSet rs = null;
		try {
			Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
			props.load(new FileInputStream(file.toFile()));

			logger.info("[==========]jdbc-reload.properties");
			for(Object key : props.keySet()) {
				logger.info("[==========]" + key + "=" + props.getProperty(String.valueOf(key)));
			}

			logger.info("[==========]query hot dict from mysql, " + props.getProperty("jdbc.reload.sql") + "......");
       Class.forName(props.getProperty("jdbc.className"));
			conn = DriverManager.getConnection(
					props.getProperty("jdbc.url"),
					props.getProperty("jdbc.user"),
					props.getProperty("jdbc.password"));
			stmt = conn.createStatement();
			rs = stmt.executeQuery(props.getProperty("jdbc.reload.sql"));

			while(rs.next()) {
				String theWord = rs.getString("word");
				logger.info("[==========]正在加载Mysql自定义IK扩展词库词条: " + theWord);
				_MainDict.fillSegment(theWord.trim().toCharArray());
			}

			Thread.sleep(Integer.valueOf(String.valueOf(props.get("jdbc.reload.interval"))) * 1000);
		} catch (Exception e) {
			logger.error("erorr", e);
		} finally {
			if(rs != null) {
				try {
					rs.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
			if(stmt != null) {
				try {
					stmt.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
			if(conn != null) {
				try {
					conn.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
		}
	}


	/**
	 * 从MySql中加载远程停用词库
	 */
	private void loadMySQLStopwordDict() {
		Connection conn = null;
		Statement stmt = null;
		ResultSet rs = null;

		try {
			Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
			props.load(new FileInputStream(file.toFile()));

			logger.info("[==========]jdbc-reload.properties");
			for(Object key : props.keySet()) {
				logger.info("[==========]" + key + "=" + props.getProperty(String.valueOf(key)));
			}
			logger.info("[==========]query hot stopword dict from mysql, " + props.getProperty("jdbc.reload.stopword.sql") + "......");
       Class.forName(props.getProperty("jdbc.className"));
			conn = DriverManager.getConnection(
					props.getProperty("jdbc.url"),
					props.getProperty("jdbc.user"),
					props.getProperty("jdbc.password"));
			stmt = conn.createStatement();
			rs = stmt.executeQuery(props.getProperty("jdbc.reload.stopword.sql"));

			while(rs.next()) {
				String theWord = rs.getString("word");
				logger.info("[==========]正在加载Mysql自定义IK停用词库词条: " + theWord);
				_StopWords.fillSegment(theWord.trim().toCharArray());
			}
			Thread.sleep(Integer.parseInt(String.valueOf(props.get("jdbc.reload.interval"))) * 1000L);
		} catch (Exception e) {
			logger.error("erorr", e);
		} finally {
			if(rs != null) {
				try {
					rs.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
			if(stmt != null) {
				try {
					stmt.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
			if(conn != null) {
				try {
					conn.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
		}
	}
	private static List<String> getRemoteWords(String location) {
		SpecialPermission.check();
		return AccessController.doPrivileged((PrivilegedAction<List<String>>) () -> {
			return getRemoteWordsUnprivileged(location);
		});
	}

	/**
	 * 从远程服务器上下载自定义词条
	 */
	private static List<String> getRemoteWordsUnprivileged(String location) {

		List<String> buffer = new ArrayList<String>();
		RequestConfig rc = RequestConfig.custom().setConnectionRequestTimeout(10 * 1000).setConnectTimeout(10 * 1000)
				.setSocketTimeout(60 * 1000).build();
		CloseableHttpClient httpclient = HttpClients.createDefault();
		CloseableHttpResponse response;
		BufferedReader in;
		HttpGet get = new HttpGet(location);
		get.setConfig(rc);
		try {
			response = httpclient.execute(get);
			if (response.getStatusLine().getStatusCode() == 200) {

				String charset = "UTF-8";
				// 获取编码,默认为utf-8
				HttpEntity entity = response.getEntity();
				if(entity!=null){
					Header contentType = entity.getContentType();
					if(contentType!=null&&contentType.getValue()!=null){
						String typeValue = contentType.getValue();
						if(typeValue!=null&&typeValue.contains("charset=")){
							charset = typeValue.substring(typeValue.lastIndexOf("=") + 1);
						}
					}

					if (entity.getContentLength() > 0 || entity.isChunked()) {
						in = new BufferedReader(new InputStreamReader(entity.getContent(), charset));
						String line;
						while ((line = in.readLine()) != null) {
							buffer.add(line);
						}
						in.close();
						response.close();
						return buffer;
					}
			}
			}
			response.close();
		} catch (IllegalStateException | IOException e) {
			logger.error("getRemoteWords {} error", e, location);
		}
		return buffer;
	}

	/**
	 * 加载用户扩展的停止词词典
	 */
	private void loadStopWordDict() {
		// 建立主词典实例
		_StopWords = new DictSegment((char) 0);

		// 读取主词典文件
		Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_STOP);
		loadDictFile(_StopWords, file, false, "Main Stopwords");

		// 加载扩展停止词典
		List<String> extStopWordDictFiles = getExtStopWordDictionarys();
		if (extStopWordDictFiles != null) {
			for (String extStopWordDictName : extStopWordDictFiles) {
				logger.info("[Dict Loading] " + extStopWordDictName);

				// 读取扩展词典文件
				file = PathUtils.get(extStopWordDictName);
				loadDictFile(_StopWords, file, false, "Extra Stopwords");
			}
		}

		// 加载远程停用词典
		List<String> remoteExtStopWordDictFiles = getRemoteExtStopWordDictionarys();
		for (String location : remoteExtStopWordDictFiles) {
			logger.info("[Dict Loading] " + location);
			List<String> lists = getRemoteWords(location);
			// 如果找不到扩展的字典,则忽略
			if (lists == null) {
				logger.error("[Dict Loading] " + location + " load failed");
				continue;
			}
			for (String theWord : lists) {
				if (theWord != null && !"".equals(theWord.trim())) {
					// 加载远程词典数据到主内存中
					logger.info(theWord);
					_StopWords.fillSegment(theWord.trim().toLowerCase().toCharArray());
				}
			}
		}

	}

	/**
	 * 加载量词词典
	 */
	private void loadQuantifierDict() {
		// 建立一个量词典实例
		_QuantifierDict = new DictSegment((char) 0);
		// 读取量词词典文件
		Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_QUANTIFIER);
		loadDictFile(_QuantifierDict, file, false, "Quantifier");
	}

	private void loadSurnameDict() {
		DictSegment _SurnameDict = new DictSegment((char) 0);
		Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_SURNAME);
		loadDictFile(_SurnameDict, file, true, "Surname");
	}

	private void loadSuffixDict() {
		DictSegment _SuffixDict = new DictSegment((char) 0);
		Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_SUFFIX);
		loadDictFile(_SuffixDict, file, true, "Suffix");
	}

	private void loadPrepDict() {
		DictSegment _PrepDict = new DictSegment((char) 0);
		Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_PREP);
		loadDictFile(_PrepDict, file, true, "Preposition");
	}

	void reLoadMainDict() {
		logger.info("start to reload ik dict.");
		// 新开一个实例加载词典,减少加载过程对当前词典使用的影响
		Dictionary tmpDict = new Dictionary(configuration);
		tmpDict.configuration = getSingleton().configuration;
		tmpDict.loadMainDict();
		tmpDict.loadStopWordDict();
		_MainDict = tmpDict._MainDict;
		_StopWords = tmpDict._StopWords;
		logger.info("reload ik dict finished.");
	}

}

image-20210527093243706

image-20210527093055164

image-20210527093135892

image-20210527093206930

  1. config目录下新增数据库相关配置文件jdbc-reload.properties

    # 数据库地址
    jdbc.url=jdbc:mysql://192.168.232.128:3306/ik_test?serverTimezone=GMT&autoReconnect=true&useUnicode=true&characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&useAffectedRows=true&useSSL=false
    # 数据库用户名
    jdbc.user=root
    # 数据库密码
    jdbc.password=123456
    # 数据库查询扩展词库sql语句
    jdbc.reload.sql=select gel.lexicon_text as word from es_lexicon gel where gel.lexicon_type = 0 and gel.lexicon_status = 0 and gel.del_flag = 0 order by gel.lexicon_id desc 
    # 数据库查询停用词sql语句
    jdbc.reload.stopword.sql=select gel.lexicon_text as word from ges_lexicon gel where gel.lexicon_type = 1 and gel.lexicon_status = 0 and gel.del_flag = 0 order by gel.lexicon_id desc 
    # 数据库查询间隔时间 每隔10秒请求一次
    jdbc.reload.interval=10
    
  2. 建表语句

    SET NAMES utf8mb4;
    SET FOREIGN_KEY_CHECKS = 0;
    
    -- ----------------------------
    -- Table structure for es_lexicon
    -- ----------------------------
    DROP TABLE IF EXISTS `es_lexicon`;
    CREATE TABLE `es_lexicon`  (
      `lexicon_id` bigint(8) NOT NULL AUTO_INCREMENT COMMENT '词库id',
      `lexicon_text` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '词条关键词',
      `lexicon_type` int(1) NOT NULL DEFAULT 0 COMMENT '0扩展词库 1停用词库',
      `lexicon_status` int(1) NOT NULL DEFAULT 0 COMMENT '词条状态 0正常 1暂停使用',
      `del_flag` int(1) NOT NULL DEFAULT 0 COMMENT '作废标志 0正常 1作废',
      `create_time` datetime(0) NOT NULL DEFAULT CURRENT_TIMESTAMP(0) COMMENT '创建时间',
      PRIMARY KEY (`lexicon_id`) USING BTREE
    ) ENGINE = InnoDB AUTO_INCREMENT = 2 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci COMMENT = 'ES远程扩展词库表' ROW_FORMAT = DYNAMIC;
    
    -- ----------------------------
    -- Records of es_lexicon
    -- ----------------------------
    INSERT INTO `es_lexicon` VALUES (1, '大脸猫', 0, 0, 1, '2021-05-26 22:33:40');
    
    SET FOREIGN_KEY_CHECKS = 1;
    

编译,执行命令 maven install

target/releases/elasticsearch-analysis-ik-7.12.1.zip解压到(es安装路径)/plugin/ik文件夹中

将对应mysql的驱动拷贝到(es安装路径)/plugin/ik文件夹中

重启ES

效果

image-20210527093706247

image-20210527093744957

ES安装中文分词器+MySQL出现权限问题

问题描述

出现access denied java.net.SocketPermission

AccessControlException: access denied ("java.net.SocketPermission" "127.0.0.1:3306" "connect,resolve")

access denied java.security.AccessControlException: access denied

access denied java.lang.RuntimePermission "setContextClassLoader";

解决方案

经自已实践,在/usr/soft/es/elasticsearch-7.12.1/jdk/conf/security/java.policy文件中添加以下代码可以解决(不一定是最好的)

	permission java.net.SocketPermission "*", "connect,resolve";
	permission java.lang.RuntimePermission  "setContextClassLoader";
	permission java.security.SecurityPermission "*";

image-20210526162027321

posted @ 2021-06-08 16:00  知白守黑,和光同尘  阅读(1343)  评论(0编辑  收藏  举报