elasticsearch-hadoop 扩展定制官方包以支持 update upsert doc

官方源码地址 https://github.com/elastic/elasticsearch-hadoop

commit elasticsearch update doc by cclient · Pull Request #1080 · elastic/elasticsearch-hadoop (github.com)

spark to es

4种操作方式
index
update
upsert
create

只支持四种操作，看文档描述，目前的需求只能用 upsert 实现，但官方的包对 upsert 支持不完整

只支持 es.update.script 方式更改，不支持更为通用的 upsert_doc,upsert doc 使用时也会涉及一些字段过滤，例如insert_date,update_date, upsert 时update_date更新，而insert_date保持不变，需要对insert_date做字段过滤

upsert 现支持
/*
* {
* "script":{
* "inline": "...",
* "lang": "...",
* "params": ...,
* },
* "upsert": {...}
* }
*/

及
/*
* {
* "doc_as_upsert": true,
* "doc": {...}
* }
*/

并不支持
/*
* {
* "upsert": {},
* "doc": {...}
* }
*/

无奈只好自已动手了

代码量太大，找到相关的部分，理顺操作逻辑后，改起来就容易多了，最主要的部分在这里

source code

把rdd json 化并拼接http 请求

两种思路

1传入完整的对像，然后修改 json 解析部分，拼接出请求
2修改对象结构，json 解析部分不变，拼接出请求

1的实现相对复杂，工作量很大，且要求对代码项目很熟悉，实现的成本很高
2的实现很简单

修改的地方很少,可以参照着自已改,需重新编译

具体看 commit，也提交到了官方，但代码比较粗暴，很可能通不过，功能优先，官方不采用，可以用的时候再个人修改。

之后打包，引用打包后的文件。

另外，程序也要作并要的修改

写入部分，对比下就知道要改的地方，很容易

case class ES_Upsert(kw_index: String, kw_type: String, id: String, date_idate: String, date_udate: String)

case class ES_Doc(date_udate: String)

case class ES_UpsertDoc(upsert: ES_Upsert, doc: ES_Doc)

.saveToEs(Map[String, String](
"es.resource" -> "{upsert.kw_index}/{upsert.kw_type}",
"es.nodes" -> es,
"es.input.json" -> "false",
"es.nodes.discovery" -> "false",
"es.update.doc" -> "true",
"es.nodes.wan.only" -> "true",
"es.write.operation" -> "upsert",
"es.mapping.exclude" -> "upsert.kw_index,upsert.kw_type,upsert.id",
"es.mapping.id" -> "upsert.id"
))

外套一层对象ES_UpsertDoc 字段名称分别为upsert，doc熟悉es的就不用解释吧
"es.update.doc" -> "true"为 true 才生效。

"es.resource" -> "{upsert.kw_index}/{upsert.kw_type}",
"es.mapping.exclude" -> "upsert.kw_index,upsert.kw_type,upsert.id"
index type field mapping 也要多套一层

项目示例

kafka spark streaming elasticsearch

https://github.com/cclient/elasticsearch-spark-upsert-from-kafka

——官方已经拒掉了，主要原因是这个包要在各种数据平台上保证可用，按官方的说法是

'whether using Map/Reduce or libraries built upon it such as Hive, Pig or Cascading or new upcoming libraries like Apache Spark'

现在的case只是基于 Spark的，即使在spark上可用，没有在其他平台的测试，不会通过，也没有精力去挨个试，等用的时候自已改吧

posted @ 2017-12-06 13:50 cclient 阅读(1512) 评论(0) 编辑收藏举报

刷新页面返回顶部

吾生也有涯，而知也无涯

心有阳光，正视黑暗

elasticsearch-hadoop 扩展定制官方包以支持 update upsert doc

公告

吾生也有涯，而知也无涯

心有阳光，正视黑暗

elasticsearch-hadoop 扩展定制 官方包以支持 update upsert doc

公告

elasticsearch-hadoop 扩展定制官方包以支持 update upsert doc