二十、 clickhouse的URL函数

所有这些功能都不遵循RFC。它们被最大程度简化以提高性能。
--- 什么事RFC？
---- Request For Comments（RFC），是一系列以编号排定的文件。文件收集了有关互联网相关信息，以及UNIX和互联网社区的软件文件。

一、提取部分 URL 的函数

如果 URL 中不存在相关部分，则返回一个空字符串。

--1.protocol

--从 URL 中提取协议。

典型返回值示例：http、https、ftp、mailto、tel、magnet……

示例

SELECT protocol('svn+ssh://some.svn-hosting.com:80/repo/trunk')

Query id: bec1936f-eb94-4223-aef2-d4e7af1e5ea4

┌─protocol('svn+ssh://some.svn-hosting.com:80/repo/trunk')─┐
│ svn+ssh                                                  │
└──────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

--2.omain

--从 URL 中提取主机名。

domain(url)

论据

url— 网址。类型：字符串。

可以使用或不使用方案来指定 URL。例子：

svn+ssh://some.svn-hosting.com:80/repo/trunk
some.svn-hosting.com:80/repo/trunk
https://yandex.com/time/

对于这些示例，该domain函数返回以下结果：

some.svn-hosting.com
some.svn-hosting.com
yandex.com

返回值

地址名。如果 ClickHouse 可以将输入字符串解析为 URL。
空字符串。如果 ClickHouse 无法将输入字符串解析为 URL。

类型：String.

例子

SELECT domain('svn+ssh://some.svn-hosting.com:80/repo/trunk')

Query id: 6e1ae2c1-7b44-4634-a4a8-cc44340f4b24

┌─domain('svn+ssh://some.svn-hosting.com:80/repo/trunk')─┐
│ some.svn-hosting.com                                   │
└────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

--3.domainWithoutWWW

--返回域并删除不超过一个“ www”。从一开始，如果存在的话。

示例

SELECT domainWithoutWWW('svn+ssh://some.svn-hosting.com:80/repo/trunk')

Query id: 741ab740-fc0f-426c-af1d-ca2f0e35bb00

┌─domainWithoutWWW('svn+ssh://some.svn-hosting.com:80/repo/trunk')─┐
│ some.svn-hosting.com                                             │
└──────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

--4.topLevelDomain

--从 URL 中提取顶级域。

topLevelDomain(url)

论据

url— 网址。类型：字符串。

可以使用或不使用方案来指定 URL。例子：

svn+ssh://some.svn-hosting.com:80/repo/trunk
some.svn-hosting.com:80/repo/trunk
https://yandex.com/time/

返回值

域名。如果 ClickHouse 可以将输入字符串解析为 URL。
空字符串。如果 ClickHouse 无法将输入字符串解析为 URL。

类型：String.

例子

SELECT topLevelDomain('svn+ssh://some.svn-hosting.com:80/repo/trunk')

Query id: 8c5354f4-a378-4726-b4ca-83ce313a4ade

┌─topLevelDomain('svn+ssh://some.svn-hosting.com:80/repo/trunk')─┐
│ com                                                            │
└────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

--5.firstSignificantSubdomain

--返回“第一个重要子域”。这是 Yandex.Metrica 特有的非标准概念。如果第一个重要子域是“com”、“net”、“org”或“co”，则它是二级域。否则为三级域。

例如

SELECT firstSignificantSubdomain('https://news.yandex.com.tr/')

Query id: 1e12f972-9589-464d-901e-2d46713a49d8

┌─firstSignificantSubdomain('https://news.yandex.com.tr/')─┐
│ yandex                                                   │
└──────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

“无关紧要的”二级域列表和其他实现细节可能会在未来发生变化。

--6.cutToFirstSignificantSubdomain

--返回包含顶级子域的域部分，直到“第一个重要子域”（参见上面的解释）。

SELECT cutToFirstSignificantSubdomain('https://news.yandex.com.tr/')

Query id: a50baf0a-09c9-4efc-b3fd-c2884f10b7c2

┌─cutToFirstSignificantSubdomain('https://news.yandex.com.tr/')─┐
│ yandex.com.tr                                                 │
└───────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

--7.cutToFirstSignificantSubdomainWithWWW

--返回包含顶级子域的域部分，直到“第一个重要子域”，而不剥离“www”。

例如：

SELECT cutToFirstSignificantSubdomain('https://news.yandex.com.tr/')

Query id: e8bd3231-4b5d-4414-92de-c78bd7efbcb5

┌─cutToFirstSignificantSubdomain('https://news.yandex.com.tr/')─┐
│ yandex.com.tr                                                 │
└───────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

--8.cutToFirstSignificantSubdomainCustom

--返回包含顶级子域到第一个重要子域的域部分。接受自定义TLD 列表名称。

如果您需要新的 TLD 列表或者您有自定义，这可能会很有用。

配置示例：

<!-- <top_level_domains_path>/var/lib/clickhouse/top_level_domains/</top_level_domains_path> -->
<top_level_domains_lists>
    <!-- https://publicsuffix.org/list/public_suffix_list.dat -->
    <public_suffix_list>public_suffix_list.dat</public_suffix_list>
    <!-- NOTE: path is under top_level_domains_path -->
</top_level_domains_lists>

句法

cutToFirstSignificantSubdomain(URL, TLD)

参数

URL— 网址。字符串。
TLD— 自定义 TLD 列表名称。字符串。

返回值

包含顶级子域到第一个重要子域的域的一部分。

类型：字符串。

例子

SELECT cutToFirstSignificantSubdomainCustom('bar.foo.there-is-no-such-domain', 'public_suffix_list');

┌─cutToFirstSignificantSubdomainCustom('bar.foo.there-is-no-such-domain', 'public_suffix_list')─┐
│ foo.there-is-no-such-domain                                                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

也可以看看

第一个重要子域。

--9.cutToFirstSignificantSubdomainCustomWithWWW

--返回包含顶级子域的域部分，直到第一个重要子域而不剥离www。接受自定义 TLD 列表名称。

如果您需要新的 TLD 列表或者您有自定义，这可能会很有用。

配置示例：

<!-- <top_level_domains_path>/var/lib/clickhouse/top_level_domains/</top_level_domains_path> -->
<top_level_domains_lists>
    <!-- https://publicsuffix.org/list/public_suffix_list.dat -->
    <public_suffix_list>public_suffix_list.dat</public_suffix_list>
    <!-- NOTE: path is under top_level_domains_path -->
</top_level_domains_lists>

句法

cutToFirstSignificantSubdomainCustomWithWWW(URL, TLD)

参数

URL— 网址。字符串。
TLD— 自定义 TLD 列表名称。字符串。

返回值

包含顶级子域的域的一部分，直到第一个重要子域，没有剥离www。

类型：字符串。

例子

SELECT cutToFirstSignificantSubdomainCustomWithWWW('www.foo', 'public_suffix_list');

┌─cutToFirstSignificantSubdomainCustomWithWWW('www.foo', 'public_suffix_list')─┐
│ www.foo                                                                      │
└──────────────────────────────────────────────────────────────────────────────┘

也可以看看

第一个重要子域。

--10.firstSignificantSubdomainCustom {

--返回第一个重要的子域。接受TLD 列表名称。

如果您需要新的 TLD 列表或者您有自定义，这可能会很有用。

配置示例：

<!-- <top_level_domains_path>/var/lib/clickhouse/top_level_domains/</top_level_domains_path> -->
<top_level_domains_lists>
    <!-- https://publicsuffix.org/list/public_suffix_list.dat -->
    <public_suffix_list>public_suffix_list.dat</public_suffix_list>
    <!-- NOTE: path is under top_level_domains_path -->
</top_level_domains_lists>

句法

firstSignificantSubdomainCustom(URL, TLD)

参数

URL— 网址。字符串。
TLD— 自定义 TLD 列表名称。字符串。

返回值

第一个重要子域。

类型：字符串。

例子

询问：

SELECT firstSignificantSubdomainCustom('bar.foo.there-is-no-such-domain', 'public_suffix_list');

┌─firstSignificantSubdomainCustom('bar.foo.there-is-no-such-domain', 'public_suffix_list')─┐
│ foo                                                                                      │
└──────────────────────────────────────────────────────────────────────────────────────────┘

也可以看看

第一个重要子域。

--11.port（URL[，default_port = 0]）

--返回端口或者default_port如果 URL 中没有端口（或在验证错误的情况下）。

--12.path

--返回路径。路径不包含查询字符串。

--13.pathFull

--同上，但包括查询字符串和片段。

--14.queryString

--返回查询字符串。查询字符串不包括最初的问号，以及 # 和 # 之后的所有内容。

--15.fragment

--返回片段标识符。片段不包括初始哈希符号。

--16.queryStringAndFragment

--返回查询字符串和片段标识符。

--17.extractURLParameter(URL, name)

-返回 URL 中的 'name' 参数的值（如果存在）。否则为空字符串。如果有许多具有此名称的参数，则返回第一个出现的参数。此函数的工作假设参数名称在 URL 中的编码方式与在传递参数中的编码方式完全相同。

--18.extractURLParameters(URL)

--返回与 URL 参数对应的 name=value 字符串数组。这些值不会以任何方式解码。

--19.extractURLParameterNames(URL)

--返回与 URL 参数名称对应的名称字符串数组。这些值不会以任何方式解码。

--20.URLHierarchy(URL)

--返回一个包含 URL 的数组，在末尾被符号 /,? 截断在路径和查询字符串中。连续的分隔符算作一个。在所有连续分隔符之后的位置进行剪切。

示例

SELECT
    port('http://paul@www.example.com:80/'),
    path('https://blog.csdn.net/u012111465/article/details/85250030'),
    pathFull('https://clickhouse.yandex/#quick-start'),
    queryString('http://paul@www.example.com:80/page=1&lr=213'),
    fragment('https://clickhouse.yandex/#quick-start'),
    queryStringAndFragment('https://www.baidu.com/s?ie=utf-8&rsv_sug7=100#ei-ai'),
    extractURLParameter('https://www.baidu.com/s?ie=utf-8&rsv_sug7=100#ei-ai', 'ie'),
    extractURLParameters('https://www.baidu.com/s?ie=utf-8&rsv_sug7=100#ei-ai'),
    extractURLParameterNames('https://www.baidu.com/s?ie=utf-8&rsv_sug7=100#ei-ai')

Query id: 554171c6-86fc-4ab5-9b5d-1bb44b6f24a4

┌─port('..')─┬─path('..')─┬─pathFull('..')─┬─queryString('http://paul@www.example.com:80/page=1&lr=213')─┬─fragment('https://clickhouse.yandex/#quick-start')─┬─queryStringAndFragment('https://www.baidu.com/s?ie=utf-8&rsv_sug7=100#ei-ai')─┬─extractURLParameter('https://www.baidu.com/s?ie=utf-8&rsv_sug7=100#ei-ai', 'ie')─┬─extractURLParameters('https://www.baidu.com/s?ie=utf-8&rsv_sug7=100#ei-ai')─┬─extractURLParameterNames('https://www.baidu.com/s?ie=utf-8&rsv_sug7=100#ei-ai')─┐
│                                      80 │ /u012111465/article/details/85250030                              │ /#quick-start                                      │                                                             │ quick-start                                        │ ie=utf-8&rsv_sug7=100#ei-ai                                                   │ utf-8                                                                            │ ['ie=utf-8','rsv_sug7=100']                                                 │ ['ie','rsv_sug7']                                                               │
└─────────────────────────────────────────┴───────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.003 sec.

--21.URLPathHierarchy(URL)

--与上面相同，但结果中没有协议和主机。/ 元素（根）不包括在内。示例：该函数用于在 Yandex 中实现树形报告 URL。公制。

示例

SELECT URLPathHierarchy('https://example.com/browse/CONV-6788')

Query id: f9f424a8-d9e6-4617-8dac-248ccc1b0d2e

┌─URLPathHierarchy('https://example.com/browse/CONV-6788')─┐
│ ['/browse/','/browse/CONV-6788']                         │
└──────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.004 sec.

--22.decodeURLComponent(URL)

--返回解码后的 URL。例子：

SELECT decodeURLComponent('http://127.0.0.1:8123/?query=SELECT%201%3B') AS DecodedURL

Query id: d6b4cefd-3315-4362-b7b6-fe2b48b94053

┌─DecodedURL─────────────────────────────┐
│ http://127.0.0.1:8123/?query=SELECT 1; │
└────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

--23.decodeURLFormComponent(URL)

--返回解码后的 URL。遵循 rfc-1866，普通的 plus( +) 被解码为 space( )。例子：

SELECT decodeURLFormComponent('http://127.0.0.1:8123/?query=SELECT%201+2%2B3') AS DecodedURL

Query id: e42a9720-73ec-4f95-ac19-48aef5e83e90

┌─DecodedURL────────────────────────────────┐
│ http://127.0.0.1:8123/?query=SELECT 1 2+3 │
└───────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

--24.netloc

username:password@host:port从 URL 中提取网络位置 ( )。

句法

netloc(URL)

论据

url— 网址。字符串。

返回值

username:password@host:port.

类型：String.

例子

SELECT netloc('http://paul@www.example.com:80/')

Query id: 38bd3547-1461-463c-ab27-63d993ebbba3

┌─netloc('http://paul@www.example.com:80/')─┐
│ paul@www.example.com:80                   │
└───────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec.

二、删除部分 URL 的函数

如果 URL 没有任何类似内容，则 URL 保持不变。

--1.cutWWW

删除不超过一个“ www”。从 URL 域的开头（如果存在）。

--2.cutQueryString

删除查询字符串。问号也被删除。

--3.cutFragment

删除片段标识符。数字符号也被删除。

--4.cutQueryStringAndFragment

删除查询字符串和片段标识符。问号和数字符号也被删除。

--5.cutURLParameter(URL, name)

删除“名称”URL 参数（如果存在）。此函数的工作假设参数名称在 URL 中的编码方式与在传递参数中的编码方式完全相同。

示例

SELECT
    cutWWW('https://www.baidu.com'),
    cutQueryString('http://www.baidu.com/1?page=1'),
    cutFragment('http://www.baidu.com/#quick-demo'),
    cutQueryStringAndFragment('http://www.baidu.com/1?page=23#we'),
    cutURLParameter('http://www.baidu.com/1?page=1#erre&resv=23&name=user', 'resv')

Query id: e55b7a13-3e28-41c8-8079-3d5b86e57a61

┌─cutWWW('https://www.baidu.com')─┬─cutQueryString('http://www.baidu.com/1?page=1')─┬─cutFragment('http://www.baidu.com/#quick-demo')─┬─cutQueryStringAndFragment('http://www.baidu.com/1?page=23#we')─┬─cutURLParameter('http://www.baidu.com/1?page=1#erre&resv=23&name=user', 'resv')─┐
│ https://baidu.com               │ http://www.baidu.com/1                          │ http://www.baidu.com/                           │ http://www.baidu.com/1                                         │ http://www.baidu.com/1?page=1#erre&name=user                                    │
└─────────────────────────────────┴─────────────────────────────────────────────────┴─────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.003 sec.

posted @ 2022-01-18 13:57 渐逝的星光阅读(2107) 评论(0) 编辑收藏举报

刷新页面返回顶部

渐逝的星光

云卷云舒风入怀，潮涨潮落月洗尘。

二十、 clickhouse的URL函数

一、提取部分 URL 的函数

--1.protocol

--2.omain

--3.domainWithoutWWW

--4.topLevelDomain

--5.firstSignificantSubdomain

--6.cutToFirstSignificantSubdomain

--7.cutToFirstSignificantSubdomainWithWWW

--8.cutToFirstSignificantSubdomainCustom

--9.cutToFirstSignificantSubdomainCustomWithWWW

--10.firstSignificantSubdomainCustom {

--11.port（URL[，default_port = 0]）

--12.path

--13.pathFull

--14.queryString

--15.fragment

--16.queryStringAndFragment

--17.extractURLParameter(URL, name)

--18.extractURLParameters(URL)

--19.extractURLParameterNames(URL)

--20.URLHierarchy(URL)

--21.URLPathHierarchy(URL)

--22.decodeURLComponent(URL)

--23.decodeURLFormComponent(URL)

--24.netloc

二、删除部分 URL 的函数

--1.cutWWW

--2.cutQueryString

--3.cutFragment

--4.cutQueryStringAndFragment

--5.cutURLParameter(URL, name)

公告