kafka-connect研究

一、https://debezium.io/releases/1.4/ 下载

二、解压 debezium-connector-postgres-1.4.0.Final-plugin.tar.gz

三、把debezium-connector-postgres下所有的jar包复制到kafka的${KAFKA-HOME}/share/java/kafka中

四、修改plugin目录

Vi ${KAFKA-HOME}/etc/kafka/connect-distributed.properties

1、plugin.path=${KAFKA-HOME}/share/java

2、bootstrap.servers=localhost:9092

五、创建kafka connector 的必要topic

./kafka-topics --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic connect-configs （分区数必须为1）

./kafka-topics --create --zookeeper localhost:2181 --replication-factor 3 --partitions 50 --topic connect-offsets

./kafka-topics --create --zookeeper localhost:2181 --replication-factor 3 --partitions 10 --topic connect-status

六、启动

nohup ./connect-distributed /usr/local/kafka/confluent-4.1.1/etc/kafka/connect-distributed.properties > nohupc.out 2>&1 &

七、配置文件示例

##
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##

# This file contains some of the configurations for the Kafka Connect distributed worker. This file is intended
# to be used with the examples, and some settings may differ from those used in a production system, especially
# the `bootstrap.servers` and those specifying replication factors.

# A list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
bootstrap.servers=逗号分隔

# unique name for the cluster, used in forming the Connect cluster group. Note that this must not conflict with consumer group IDs
group.id=connect-cluster

# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply
# it to
key.converter.schemas.enable=true
value.converter.schemas.enable=true

# The internal converter used for offsets, config, and status data is configurable and must be specified, but most users will
# always want to use the built-in default. Offset, config, and status data is never visible outside of Kafka Connect in this format.
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false

# Topic to use for storing offsets. This topic should have many partitions and be replicated and compacted.
# Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
#offset.storage.partitions=25

# Topic to use for storing connector and task configurations; note that this should be a single partition, highly replicated,
# and compacted topic. Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
config.storage.topic=connect-configs
config.storage.replication.factor=1

# Topic to use for storing statuses. This topic can have multiple partitions and should be replicated and compacted.
# Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
status.storage.topic=connect-status
status.storage.replication.factor=1
#status.storage.partitions=5

# Flush much faster than normal, which is useful for testing/debugging
offset.flush.interval.ms=10000

# These are provided to inform the user about the presence of the REST host and port configs
# Hostname & Port for the REST API to listen on. If this is set, it will bind to the interface used to listen to requests.
#rest.host.name=
#rest.port=8083

# The Hostname & Port that will be given out to other workers to connect to i.e. URLs that are routable from other servers.
#rest.advertised.host.name=
#rest.advertised.port=

# Set to a list of filesystem paths separated by commas (,) to enable class loading isolation for plugins
# (connectors, converters, transformations). The list should consist of top level directories that include
# any combination of:
# a) directories immediately containing jars with plugins and their dependencies
# b) uber-jars with plugins and their dependencies
# c) directories immediately containing the package directory structure of classes of plugins and their dependencies
# Examples:
# plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
# Replace the relative path below with an absolute path if you are planning to start Kafka Connect from within a
# directory other than the home directory of Confluent Platform.
plugin.path=/usr/local/kafka/confluent-4.1.1/share/java

View Code

ps:

不要默认创建的topic，默认创建副本可能为1，使之丢失数据。

新建一个connector; 请求体必须是json格式并且需要包含name字段和config字段，name是connector的名字，config是json格式，必须包含你的connector的配置信息。

八、api

1、新增

Url: http://ip:8083/connectors

报文：

{
    "name": "person-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "数据库ip",
        "database.history.kafka.bootstrap.servers":"kafka集群",
        "database.port": "5432",
        "database.user": "postgres",
        "database.password": "postgres",
        "database.dbname": "postgres",
        "database.server.name": "ninezero",
        "table.include.list": "public.person",
                    "slot.name": "slot_name_pserson",
                    "decimal.handling.mode": "string",
"snapshot.mode": "exported",
        "plugin.name": "decoderbufs"

    }
}

属性	默认	描述
name		连接器的唯一名称。再次尝试使用相同名称注册将失败。所有Kafka Connect连接器都需要此属性。
connector.class		连接器的Java类的名称。始终io.debezium.connector.postgresql.PostgresConnector对PostgreSQL连接器使用值。
tasks.max	1	为此连接器应创建的最大任务数。PostgreSQL连接器始终只使用一个任务，因此不使用该值，因此默认值始终可以接受。
plugin.name	decoderbufs	安装在PostgreSQL服务器上的PostgreSQL逻辑解码插件的名称。支持的值是decoderbufs，wal2json，wal2json_rds，wal2json_streaming，wal2json_rds_streaming和pgoutput。如果您使用的是wal2json插件并且事务非常大，则包含所有事务更改的JSON批处理事件可能不适用于大小为1 GB的硬编码内存缓冲区。在这种情况下，可以通过将plugin-name属性设置为wal2json_streaming或切换到流式插件wal2json_rds_streaming。通过流式插件，PostgreSQL针对事务中的每个更改向连接器发送单独的消息。
slot.name	debezium	为流式更改而创建的PostgreSQL逻辑解码槽的名称，该更改来自特定插件的特定数据库/架构。服务器使用此插槽将事件流传输到您正在配置的Debezium连接器。插槽名称必须符合PostgreSQL复制插槽的命名规则，该规则指出：“每个复制插槽都有一个名称，该名称可以包含小写字母，数字和下划线字符。”
slot.drop.on.stop	false	当连接器以正常预期的方式停止时是否删除逻辑复制插槽。默认行为是当连接器停止时，复制插槽仍为该连接器配置。连接器重新启动时，具有相同的复制插槽使连接器可以从上次中断的地方开始处理。 true仅在测试或开发环境中设置为。删除插槽使数据库可以丢弃WAL段。连接器重新启动时，它将执行新的快照，也可以从Kafka Connect偏移量主题中的持久偏移量继续。
publication.name	dbz_publication	使用时，为流式传输而创建的PostgreSQL出版物的名称发生了变化pgoutput。如果该发布尚不存在，并且包含所有表，则会在启动时创建。然后，Debezium应用其自己的包含/排除列表过滤（如果已配置），以限制发布以更改感兴趣的特定表的事件。连接器用户必须具有超级用户权限才能创建此发布，因此通常最好在首次启动连接器之前创建发布。如果发布已经存在（对于所有表或已配置表的子集），Debezium都会使用定义的发布。
database.hostname		PostgreSQL数据库服务器的IP地址或主机名。
database.port	5432	PostgreSQL数据库服务器的整数端口号。
database.user		用于连接到PostgreSQL数据库服务器的PostgreSQL数据库用户的名称。
database.password		连接到PostgreSQL数据库服务器时使用的密码。
database.dbname		从中流式传输更改的PostgreSQL数据库的名称。
database.server.name		逻辑名称，用于标识Debezium在其中捕获更改的特定PostgreSQL数据库服务器或群集并为其提供名称空间。数据库服务器逻辑名称中只能使用字母数字字符和下划线。逻辑名称在所有其他连接器上应该是唯一的，因为它用作所有从该连接器接收记录的Kafka主题的主题名称前缀。
schema.include.list		可选的，用逗号分隔的正则表达式列表，与要捕获其更改的模式名称匹配。任何未包含的模式名称都不会schema.include.list捕获其更改。默认情况下，所有非系统架构都会捕获其更改。也不要设置该schema.exclude.list属性。
schema.exclude.list		可选的，用逗号分隔的正则表达式列表，这些列表与您不希望捕获其更改的模式名称相匹配。schema.exclude.list除系统架构外，任何未包含名称的架构都将捕获其更改。也不要设置该schema.include.list属性。
table.include.list		可选的，用逗号分隔的正则表达式列表，与要捕获其更改的表的标准表标识符匹配。未包含在其中的任何表table.include.list都不会捕获其更改。每个标识符的格式为schemaName。tableName。默认情况下，连接器会捕获要捕获其更改的每个架构中每个非系统表中的更改。也不要设置该table.exclude.list属性。
table.exclude.list		可选的，用逗号分隔的正则表达式列表，与您不希望捕获其更改的表的标准表标识符匹配。未包含在其中的任何表table.exclude.list都会捕获其更改。每个标识符的格式为schemaName。tableName。也不要设置该table.include.list属性。
column.include.list		可选的，用逗号分隔的正则表达式列表，与应包含在更改事件记录值中的列的全限定名称匹配。列的标准名称的格式为schemaName。tableName。columnName。也不要设置该column.exclude.list属性。
column.exclude.list		可选的，用逗号分隔的正则表达式列表，与应从更改事件记录值中排除的列的完全限定名称匹配。列的完全限定名称的格式为schemaName。tableName。columnName。也不要设置该column.include.list属性。
time.precision.mode	adaptive	可以使用不同的精度来表示时间，日期和时间戳： adaptive根据数据库列的类型，使用毫秒，微秒或纳秒的精度值来捕获与数据库中完全相同的时间和时间戳值。 adaptive_time_microseconds根据数据库列的类型使用毫秒，毫秒或纳秒精度值来捕获与数据库中完全相同的日期，日期时间和时间戳记值。TIME类型字段是一个例外，它们总是捕获为微秒。 connect总是用表示时间和时间戳值卡夫卡连接的内置表示为Time，Date和Timestamp，无论哪个使用的数据库列的精度精确到毫秒。参见时间值。
decimal.handling.mode	precise	指定连接器应如何处理DECIMAL和NUMERIC列的 precise值：通过java.math.BigDecimal在变更事件中以二进制形式表示值来表示值。 double通过使用double值表示值，这可能会导致精度损失，但更易于使用。 string将值编码为格式化的字符串，这些字符串很容易使用，但有关实型的语义信息丢失了。请参见十进制类型。
hstore.handling.mode	map	指定连接器应如何处理hstore列的 map值：通过使用表示值MAP。 json使用表示值json string。此设置将值编码为格式字符串，例如{"key" : "val"}。参见PostgreSQLHSTORE类型。
interval.handling.mode	numeric	指定连接器应如何处理interval列值： numeric使用大约微秒数表示间隔。 string通过使用字符串模式表示来精确地表示间隔P<years>Y<months>M<days>DT<hours>H<minutes>M<seconds>S。例如：P1Y2M3DT4H5M6.78S。请参阅PostgreSQL基本类型。
database.sslmode	disable	是否使用到PostgreSQL服务器的加密连接。选项包括： disable使用未加密的连接。 require使用安全（加密）连接，如果无法建立则失败。 verify-ca行为类似于require但也对照已配置的证书颁发机构（CA）证书验证服务器TLS证书，或者如果找不到有效的匹配CA证书则失败。 verify-full行为类似于verify-ca但也验证服务器证书是否与连接器尝试连接的主机匹配。有关更多信息，请参见PostgreSQL文档。
database.sslcert		包含客户端SSL证书的文件的路径。有关更多信息，请参见PostgreSQL文档。
database.sslkey		包含客户端的SSL私钥的文件的路径。有关更多信息，请参见PostgreSQL文档。
database.sslpassword		从中指定的文件访问客户端私钥的密码database.sslkey。有关更多信息，请参见PostgreSQL文档。
database.sslrootcert		包含验证服务器所依据的根证书的文件的路径。有关更多信息，请参见PostgreSQL文档。
database.tcpKeepAlive	true	启用“ TCP保持活动”探针以验证数据库连接仍处于活动状态。有关更多信息，请参见PostgreSQL文档。
tombstones.on.delete	true	控制是否应在删除事件之后生成逻辑删除事件。 true-删除操作由删除事件和随后的逻辑删除事件表示。 false-仅发送删除事件。一个后删除操作，散发出墓碑事件使卡夫卡删除具有相同的密钥被删除的行的所有更改事件的记录。
column.truncate.to._length_.chars	不适用	可选的，用逗号分隔的正则表达式列表，它们与基于字符的列的标准名称匹配。列的标准名称的格式为schemaName。tableName。columnName。在更改事件记录中，如果这些列中的值大于属性名称中由length指定的字符数，则这些列将被截断。您可以在单个配置中指定具有不同长度的多个属性。长度必须为正整数，例如+column.truncate.to.20.chars。
column.mask.with._length_.chars	不适用	可选的，用逗号分隔的正则表达式列表，它们与基于字符的列的标准名称匹配。列的标准名称的格式为schemaName。tableName。columnName。在更改事件值中，指定表列中的值将替换为星号（）字符的长度数*。您可以在单个配置中指定具有不同长度的多个属性。长度必须为正整数或零。当您指定零时，连接器将用空字符串替换一个值。
column.mask.hash._hashAlgorithm_.with.salt._salt_	不适用	可选的，用逗号分隔的正则表达式列表，它们与基于字符的列的标准名称匹配。列的完全限定名称的格式为schemaName。tableName。columnName。在更改事件值中，指定列中的值被假名替换。假名由应用指定的hashAlgorithm和salt产生的哈希值组成。基于所使用的哈希函数，当用假名替换列值时，将保持引用完整性。Java密码体系结构体系结构标准算法名称文档的MessageDigest部分中描述了受支持的哈希函数。如有必要，假名会自动缩短为列的长度。您可以在一个配置中使用不同的哈希算法和盐指定多个属性。在下面的示例中，CzQMA0cB5K是一个随机选择的盐。 column.mask.hash.SHA-256.with.salt.CzQMA0cB5K =inventory.orders.customerName,inventory.shipment.customerName 根据所使用的hashAlgorithm，所选的盐和实际数据集，可能无法完全屏蔽所得的屏蔽数据集。
column.propagate.source.type	不适用	与列的全限定名称匹配的可选的，用逗号分隔的正则表达式列表。列的标准名称的格式为databaseName。tableName。columnName或databaseName。schemaName。tableName。columnName。对于每个指定的列，连接器将列的原始类型和原始长度作为参数添加到发出的更改记录中的相应字段模式。以下添加的模式参数传播原始类型名称以及可变宽度类型的原始长度： __debezium.source.column.type+ __debezium.source.column.length+__debezium.source.column.scale 此属性对于正确调整接收器数据库中相应列的大小很有用。
datatype.propagate.source.type	不适用	可选的，用逗号分隔的正则表达式列表，与某些列的特定于数据库的数据类型名称匹配。完全限定的数据类型名称的格式为databaseName。tableName。typeName或databaseName。schemaName。tableName。typeName。对于这些数据类型，连接器将参数添加到发出的更改记录中的相应字段模式。添加的参数指定列的原始类型和长度： __debezium.source.column.type+ __debezium.source.column.length+__debezium.source.column.scale 这些参数分别传播可变宽度类型的列的原始类型名称和长度。此属性对于正确调整接收器数据库中相应列的大小很有用。请参阅PostgreSQL特定的数据类型名称列表。
message.key.columns	空字符串	用分号分隔的表列表，这些表具有与表列名匹配的正则表达式。连接器将匹配列中的值映射到发送给Kafka主题的变更事件记录中的关键字段。当表没有主键时，或者您要根据不是主键的字段对Kafka主题中的更改事件记录进行排序时，这很有用。用分号分隔条目。在完全限定的表名及其正则表达式之间插入一个冒号。格式为： schema-name。表名：_regexp_; ... 例如， schemaA.table_a:regex_1;schemaB.table_b:regex_2;schemaC.table_c:regex_3 如果table_a具有一个id柱，并且regex_1是^i（即从任何列相匹配i，连接器映射中的值）table_a的id 连接器发送到Kafka的更改事件中的关键字段的“列”列。
publication.autocreate.mode	all_tables	通过使用流的变化，只有当适用的pgoutput插件。该设置确定出版物的创建方式。可能的设置是： all_tables-如果存在发布，则连接器将使用它。如果发布不存在，则连接器将为数据库中连接器正在捕获更改的所有表创建一个发布。这要求具有执行复制权限的数据库用户还具有创建发布的权限。这是通过授予的CREATE PUBLICATION <publication_name> FOR ALL TABLES;。 disabled-连接器不尝试创建发布。在运行连接器之前，数据库管理员或配置为执行复制的用户必须已创建发布。如果连接器找不到出版物，则连接器将引发异常并停止。 filtered-如果存在出版物，则连接器将使用它。如果没有发布存在，连接器创建用于由指定的匹配当前滤波器配置该表的新的出版物database.exclude.list，schema.include.list，schema.exclude.list，和table.include.list连接器的配置属性。例如：CREATE PUBLICATION <publication_name> FOR TABLE <tbl1, tbl2, tbl3>。
binary.handling.mode	个字节	指定bytea在更改事件中应如何表示二进制（）列： bytes将二进制数据表示为字节数组。 base64将二进制数据表示为base64编码的字符串。 hex将二进制数据表示为十六进制编码的（base16）字符串。
truncate.handling.mode	个字节	指定是否TRUNCATE传播事件（仅pgoutput在与Postgres 11或更高版本一起使用该插件时可用）： skip导致忽略这些事件（默认）。 include导致将hos事件包括在内。 +请参阅截断事件的结构截断事件及其

详细说明查看：https://debezium.io/documentation/reference/1.4/connectors/postgresql.html

其他api：

GET /connectors – 返回所有正在运行的connector名

GET /connectors/{name} – 获取指定connetor的信息

GET /connectors/{name}/config – 获取指定connector的配置信息

PUT /connectors/{name}/config – 更新指定connector的配置信息

GET /connectors/{name}/status – 获取指定connector的状态，包括它是否在运行、停止、或者失败，如果发生错误，还会列出错误的具体信息。

GET /connectors/{name}/tasks – 获取指定connector正在运行的task。

GET /connectors/{name}/tasks/{taskid}/status – 获取指定connector的task的状态信息

PUT /connectors/{name}/pause – 暂停connector和它的task，停止数据处理知道它被恢复。

PUT /connectors/{name}/resume – 恢复一个被暂停的connector

POST – 重启一个connector，尤其是在一个connector运行失败的情况下比较常用

POST /connectors/{name}/tasks/{taskId}/restart – 重启一个task，一般是因为它运行失败才这样做。

DELETE /connectors/{name} – 删除一个connector，停止它的所有task并删除配置。

具体信息查看：

https://docs.confluent.io/platform/current/connect/references/restapi.html

pg数据复制逻辑槽查询与删除

select * from pg_replication_slots;

SELECT * FROM pg_drop_replication_slot('slot_name_yxswxt');

posted @ 2021-05-24 11:13 TimeSay 阅读(280) 评论(0) 编辑收藏举报

刷新页面返回顶部

TimeSay

kafka-connect研究

公告