入门ClickHouse:从安装到使用
ClickHouse是一款高性能的列式数据库管理系统,特别适用于在线分析处理(OLAP: Online Analytical Processing,在线分析处理)任务。本文将带你从安装到使用ClickHouse,详细介绍其基础操作和一些高级功能。
安装与配置
首先,从ClickHouse官网下载安装包并进行安装。安装完成后,配置ClickHouse服务。
配置文件
ClickHouse的配置文件通常位于 /etc/clickhouse-server/config.xml。你可以在这个文件中配置各种参数,例如网络设置、存储路径等。
参考官方文档:ClickHouse Configuration Files
配置文件有2部分组成
config.xml
<default_profile>default</default_profile>
<users_config>users.xml</users_config>
<grpc_port>9201</grpc_port>
<enable_ssl>false</enable_ssl>
<ssl_cert_file>/path/to/ssl_cert_file</ssl_cert_file>
<ssl_key_file>/path/to/ssl_key_file</ssl_key_file>
<ssl_require_client_auth>false</ssl_require_client_auth>
<ssl_ca_cert_file>/path/to/ssl_ca_cert_file</ssl_ca_cert_file>
<transport_compression_type>none</transport_compression_type>
<transport_compression_level>0</transport_compression_level>
<max_send_message_size>-1</max_send_message_size>
<max_receive_message_size>-1</max_receive_message_size>
<verbose_logs>false</verbose_logs>
<http_options_response>
x-clickhouse-key, Authorization
</http_options_response>
<http_port>8123</http_port>
<tcp_port>9101</tcp_port>
<interserver_http_port>9009</interserver_http_port>
<max_connections>4096</max_connections>
<keep_alive_timeout>10</keep_alive_timeout>
<concurrent_threads_soft_limit_num>0</concurrent_threads_soft_limit_num>
<concurrent_threads_soft_limit_ratio_to_cores>22</concurrent_threads_soft_limit_ratio_to_cores>
<max_concurrent_queries>1000</max_concurrent_queries>
<listen_host>0.0.0.0</listen_host>
<!-- Maximum memory usage (resident set size) for server process.
Zero value or unset means default. Default is "max_server_memory_usage_to_ram_ratio" of available
physical RAM.
If the value is larger than "max_server_memory_usage_to_ram_ratio" of available physical RAM, it
will be cut down.
The constraint is checked on query execution time.
If a query tries to allocate memory and the current memory usage plus allocation is greater
than specified threshold, exception will be thrown.
It is not practical to set this constraint to small values like just a few gigabytes,
because memory allocator will keep this amount of memory in caches and the server will deny service
of queries.
-->
<max_server_memory_usage>0</max_server_memory_usage>
<max_thread_pool_size>10000</max_thread_pool_size>
<async_load_databases>true</async_load_databases>
<max_server_memory_usage_to_ram_ratio>0.9</max_server_memory_usage_to_ram_ratio>
<total_memory_profiler_step>4194304</total_memory_profiler_step>
<total_memory_tracker_sample_probability>0</total_memory_tracker_sample_probability>
<!-- Size of cache of uncompressed blocks of data, used in tables of MergeTree family.
In bytes. Cache is single for server. Memory is allocated only on demand.
Cache is used when 'use_uncompressed_cache' user setting turned on (off by default).
Uncompressed cache is advantageous only for very short queries and in rare cases.
Note: uncompressed cache can be pointless for lz4, because memory bandwidth
is slower than multi-core decompression on some server configurations.
Enabling it can sometimes paradoxically make queries slower.
-->
<uncompressed_cache_size>8589934592</uncompressed_cache_size>
<custom_cached_disks_base_directory>/var/lib/clickhouse/caches/</custom_cached_disks_base_directory>
<validate_tcp_client_information>false</validate_tcp_client_information>
<tmp_path>/data3/tmp/</tmp_path>
<allow_plaintext_password>1</allow_plaintext_password>
<allow_no_password>1</allow_no_password>
<allow_implicit_no_password>1</allow_implicit_no_password>
<custom_settings_prefixes>SQL_</custom_settings_prefixes>
<default_database>default</default_database>
<!-- Server time zone could be set here.
Time zone is used when converting between String and DateTime types,
when printing DateTime in text formats and parsing DateTime from text,
it is used in date and time related functions, if specific time zone was not passed as an argument.
Time zone is specified as identifier from IANA time zone database, like UTC or Africa/Abidjan.
If not specified, system time zone at server startup is used.
Please note, that server could display time zone alias instead of specified name.
Example: Zulu is an alias for UTC.
-->
<mlock_executable>true</mlock_executable>
<remap_executable>false</remap_executable>
<builtin_dictionaries_reload_interval>3600</builtin_dictionaries_reload_interval>
<max_session_timeout>3600</max_session_timeout>
<default_session_timeout>60</default_session_timeout>
<query_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<!--
Table TTL specification:
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-table-ttl
Example:
event_date + INTERVAL 1 WEEK
event_date + INTERVAL 7 DAY DELETE
event_date + INTERVAL 2 WEEK TO DISK 'bbb'
-->
<!--
ORDER BY expr:
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#order_by
Example:
event_date, event_time
event_date, type, query_id
event_date, event_time, initial_query_id
<order_by>event_date, event_time, initial_query_id</order_by>
-->
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</query_log>
<trace_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</trace_log>
<query_thread_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</query_thread_log>
<query_views_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_views_log>
<part_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</part_log>
<metric_log>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<collect_interval_milliseconds>1000</collect_interval_milliseconds>
<flush_on_crash>false</flush_on_crash>
</metric_log>
<error_log>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<collect_interval_milliseconds>1000</collect_interval_milliseconds>
<flush_on_crash>false</flush_on_crash>
</error_log>
<asynchronous_metric_log>
<flush_interval_milliseconds>7000</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</asynchronous_metric_log>
<opentelemetry_span_log>
engine MergeTree
partition by toYYYYMM(finish_date)
order by (finish_date, finish_time_us, trace_id)
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</opentelemetry_span_log>
<crash_log>
<partition_by />
<flush_interval_milliseconds>1000</flush_interval_milliseconds>
<max_size_rows>1024</max_size_rows>
<reserved_size_rows>1024</reserved_size_rows>
<buffer_size_rows_flush_threshold>512</buffer_size_rows_flush_threshold>
<flush_on_crash>true</flush_on_crash>
</crash_log>
<!-- Session log. Stores user log in (successful or not) and log out events.
Note: session log has known security issues and should not be used in production.
-->
<!-- <session_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</session_log> -->
<processors_profile_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</processors_profile_log>
<asynchronous_insert_log>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
<partition_by>event_date</partition_by>
</asynchronous_insert_log>
<backup_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</backup_log>
<s3queue_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</s3queue_log>
<blob_storage_log>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</blob_storage_log>
<!-- Custom TLD lists.
Format:
Changes will not be applied w/o server restart.
Path to the list is under top_level_domains_path (see above).
-->
<top_level_domains_lists>
</top_level_domains_lists>
<dictionaries_config>_dictionary.ml</dictionaries_config>
<dictionaries_lazy_load>true</dictionaries_lazy_load>
<wait_dictionaries_load_at_startup>true</wait_dictionaries_load_at_startup>
<user_defined_executable_functions_config>_function.ml</user_defined_executable_functions_config>
<!--
<!- - Set of variants. Checked in order. Last matching case wins. If nothing matches, lz4 will be
used. - ->
<!- - Conditions. All must be satisfied. Some conditions may be omitted. - ->
<min_part_size>10000000000</min_part_size> <!- - Min part size in bytes. - ->
<min_part_size_ratio>0.01</min_part_size_ratio> <!- - Min size of part relative to whole table
size. - ->
<!- - What compression method to use. - ->
-->
<encryption_codecs>
</encryption_codecs>
users.xml
<max_memory_usage>10000000000</max_memory_usage>
<use_uncompressed_cache>0</use_uncompressed_cache>
<load_balancing>random</load_balancing>
<max_memory_usage>10000000000</max_memory_usage>
<use_uncompressed_cache>0</use_uncompressed_cache>
<load_balancing>random</load_balancing>
<result_rows>0</result_rows>
<read_rows>0</read_rows>
<execution_time>0</execution_time>
<password_sha256_hex>948fe603f61dc036b5c596dc09fe3ce3f3d30dc90f024c85f3c82db2ccab679d</password_sha256_hex>
<password_sha256_hex>1917e33407c28366c8e3b975b17e7374589312676b90229adb4ce6e58552e223</password_sha256_hex>
启动服务
使用以下命令启动ClickHouse服务器:
/data3/tmp/clickhouse server -C /data3/tmp/config.xml
启动客户端:
/data3/tmp/clickhouse client -C /data3/tmp/config.xml
创建数据库和表
首先,创建一个数据库:
CREATE DATABASE IF NOT EXISTS helloworld;
然后,创建一个表:
CREATE TABLE my_first_table
(
user_id UInt32,
message String,
timestamp DateTime,
metric Float32
)
ENGINE = MergeTree
PRIMARY KEY (user_id, timestamp);
插入数据
往表中插入数据:
INSERT INTO my_first_table (user_id, message, timestamp, metric) VALUES
(101, 'Hello, ClickHouse!', now(), -1.0),
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421),
(102, 'Sort your data based on your commonly-used queries', today(), 2.718),
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159);
查询数据
从表中查询数据:
SELECT * FROM my_first_table ORDER BY timestamp;
高级查询示例
使用S3数据源进行查询:
SELECT
passenger_count,
avg(toFloat32(total_amount))
FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_0.gz',
'TabSeparatedWithNames'
)
GROUP BY passenger_count
ORDER BY passenger_count;
创建更多表和操作
创建另一个表:
CREATE TABLE my_two_table (
name String,
ping UInt32,
created_at DateTime DEFAULT now()
) ENGINE = MergeTree()
ORDER BY created_at;
插入数据:
INSERT INTO my_two_table (name, ping) VALUES ('Alice', 12);
INSERT INTO my_two_table (name, ping) VALUES ('Bob', 13);
进行不同时间范围内的数据查询:
SELECT * FROM my_two_table
WHERE created_at >= toDateTime('2020-01-01 00:00:00') AND created_at <= toDateTime('2020-01-01 23:59:59');
SELECT * FROM my_two_table
WHERE created_at > toDateTime('2022-03-03 09:46:33') AND created_at < toDateTime('2022-03-05 09:46:33');
数据分析
进行数据聚合查询:
SELECT
toStartOfMinute(created_at) AS minute,
count(*) AS count,
avg(ping) as ping
FROM my_two_table
WHERE name='Alice'
GROUP BY minute
ORDER BY minute;
删除数据:
DELETE FROM my_two_table WHERE created_at = toDateTime('2020-01-01 00:00:00');
密码管理
生成随机密码并进行SHA-256哈希:
PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'
生成客户端字符串的SHA-256哈希:
echo -n "client" | sha256sum | cut -d ' ' -f 1
总结
ClickHouse不仅支持高性能的数据存储与检索,还提供了丰富的分析功能,非常适合用于处理大规模数据的实时分析需求。
如果ClickHouse有更深入的兴趣,可以访问其快速入门文档以获取更多信息和示例。
测试脚本
from clickhouse_driver import Client
import random
from datetime import datetime, timedelta
import time
# 连接 ClickHouse
client = Client(
host='localhost',
port=9101,
user='default', # 替换为你的用户名
password='client', # 替换为你的密码
database='helloworld' # 指
)
# 配置参数 1 1000 0002
num_records = 5_0000_0000
batch_size = 500000 # 每次批量插入的记录数
start_date = datetime(2020, 1, 1)
name_pool = ['Alice', 'Bob', 'Carol', 'Dave', 'Eve']
# 数据生成器
def generate_data(num_records, start_date, name_pool):
for i in range(num_records):
name = random.choice(name_pool)
ping = random.randint(0, 1000)
created_at = start_date + timedelta(seconds=i)
yield (name, ping, created_at)
# 批量插入数据
def batch_insert_data(client, data_generator, batch_size):
batch = []
for record in data_generator:
batch.append(record)
if len(batch) == batch_size:
client.execute('INSERT INTO my_two_table (name, ping, created_at) VALUES', batch)
batch.clear()
if batch:
client.execute('INSERT INTO my_two_table (name, ping, created_at) VALUES', batch)
# 主函数
def main():
data_generator = generate_data(num_records, start_date, name_pool)
batch_insert_data(client, data_generator, batch_size)
print("数据插入完成")
if __name__ == '__main__':
x = time.time()
main()
print('插入{}千万数据耗时 {}秒'.format(num_records, time.time() - x))
# clickhouse
# 插入1 千万数据耗时 69.04075813293457秒
# 插入1 亿数据耗时 655.3600518703461秒
# 插入5 亿数据耗时 3320.025939464569秒 => 55.33分钟
# 磁盘文件大小 5.2G
浙公网安备 33010602011771号