大数据基础之ElasticSearch(4)es数据导入过程
1 准备analyzer
内置analyzer
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
中文分词
smartcn
参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html
ik
$ bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip
参考:https://github.com/medcl/elasticsearch-analysis-ik
其他plugins
参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html
2 创建索引--准备mapping,确定shards、replication
# curl -XPUT -H 'Content-Type: application/json' http://localhost:9200/testdoc -d ' { "settings": { "index.number_of_shards" : 10, "index.number_of_routing_shards" : 30, "index.number_of_replicas":1, "index.translog.durability": "async", "index.merge.scheduler.max_thread_count": 1, "index.refresh_interval": "30s" }, "mappings": { "_doc": { "_all": { "enabled": false }, "_source": { "enabled": false }, "properties": { "title": { "type": "text", "analyzer": "ik_smart"}, "name": { "type": "keyword", "doc_values": false}, "age": { "type": "integer", "index": false}, "created": { "type": "date", "format": "strict_date_optional_time||epoch_millis" } } } } }'
其中:
_source 控制是否存储原始json
_all 控制是否对原始json建倒排
analyzer 用于指定分词
doc_values 用于控制是否列式存储
index 用于控制是否倒排
The _source field stores the original JSON body of the document. If you don’t need access to it you can disable it.
By default Elasticsearch indexes and adds doc values to most fields so that they can be searched and aggregated out of the box.
数据类型
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
其中String有两种:text和keyword,区别是text会被分词,keyword不会被分词;
text
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html
keyword
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html
3 导入数据
3.1 调用index api
参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
3.2 准备hive外部表
详见:https://www.cnblogs.com/barneywill/p/10300951.html
4 测试
# curl -XPOST -H 'Content-Type: application/json' 'http://localhost:9200/_xpack/sql?format=txt' -d '{"query":"select * from testdoc limit 10"}'
or
# curl -XGET 'http://localhost:9200/testdoc/_search?q=*'
5 问题
报错:all nodes failed
2019-03-27 03:14:50,091 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.1:9200] failed (Read timed out); selected next node [192.168.0.1:9200] 2019-03-27 03:15:50,148 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.2:9200] failed (Read timed out); selected next node [192.168.0.2:9200] 2019-03-27 03:16:50,207 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.3:9200] failed (Read timed out); no other nodes left - aborting... 2019-03-27 03:16:50,208 ERROR [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: Hit error while closing operators - failing tree 2019-03-27 03:16:50,210 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: Hive Runtime Error while closing operators at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.0.1:9200, 192.168.0.2:9200, 192.168.0.3:9200]] at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:152) at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:398) at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:362) at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:366) at org.elasticsearch.hadoop.rest.RestClient.refresh(RestClient.java:267) at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.close(BulkProcessor.java:550) at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:219) at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214) at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.close(EsHiveOutputFormat.java:74) at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:190) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1047) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:189) ... 8 more
解决方法:增加 index.number_of_shards,只能在创建索引时指定,默认为5
报错:es_rejected_execution_exception
Caused by: org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [70/1000]. Error sample (first [5] error messages): org.elasticsearch.hadoop.rest.EsHadoopRemoteException: es_rejected_execution_exception: rejected execution of processing of [7622922][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[test_indix][18]] containing [38] requests, target allocation id: iLlIBScJTxahse559pTINQ, primary term: 1 on EsThreadPoolExecutor[name = 1hxgYU_/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@ce11763[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 5686436]]
报错原因:
thread_pool.write.queue_size
For single-document index/delete/update and bulk requests. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.
The queue_size allows to control the size of the queue of pending requests that have no threads to execute them. By default, it is set to -1 which means its unbounded. When a request comes in and the queue is full, it will abort the request.
查看thread_pool统计
# curl 'http://localhost:9200/_nodes/stats?pretty'|grep '"write"' -A 7
通常由于写入速度、并发量或者压力较大超过es处理能力,超出queue的大小就会被reject
解决方法:
1)修改配置调优
index.refresh_interval: -1
index.number_of_replicas: 0
indices.memory.index_buffer_size: 40%
thread_pool.write.queue_size: 1024
详见:https://www.cnblogs.com/barneywill/p/10615249.html
2)减小写入压力
---------------------------------------------------------------- 结束啦,我是大魔王先生的分割线 :) ----------------------------------------------------------------
- 由于大魔王先生能力有限,文中可能存在错误,欢迎指正、补充!
- 感谢您的阅读,如果文章对您有用,那么请为大魔王先生轻轻点个赞,ありがとう
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· 周边上新:园子的第一款马克杯温暖上架
· Open-Sora 2.0 重磅开源!
· 分享 3 个 .NET 开源的文件压缩处理库,助力快速实现文件压缩解压功能!
· Ollama——大语言模型本地部署的极速利器
· DeepSeek如何颠覆传统软件测试?测试工程师会被淘汰吗?