【2022.06.20】一些用得上的huginn代理
huginn的代理真的是太多了,我需要专门用一个帖子来记录用得上的代理
Http Status Agent
HttpStatusAgent将检查一个url,并发出结果的HTTP状态代码,其中包含它等待回复的时间。此外,它还将选择性地发出一个或多个指定标头的值。
指定 ,Http 状态代理将生成一个具有 HTTP 状态代码的事件。如果同时指定一个或多个(逗号分隔),则该标头的值将包含在事件中。Url``Headers to save
该选项会导致代理不遵循 HTTP 重定向。例如,将此设置为将导致接收 301 重定向的代理返回状态 301,而不是跟随重定向并返回 200。disable redirect follow``true``http://yahoo.com
该选项使代理仅在状态更改时才报告事件。如果设置为 false,则将为每个检查创建一个事件。如果设置为 true,则仅当状态更改时(例如,您的网站从 200 更改为 500),才会创建事件。changes only
Java Script Agent
The JavaScript Agent allows you to write code in JavaScript that can create and receive events. If other Agents aren’t meeting your needs, try this one!
You can put code in the option, or put your code in a Credential and reference it from with (recommended).code``code``credential:<name>
You can implement and as you see fit. The following methods will be available on Agent in the JavaScript environment:Agent.check``Agent.receive
this.createEvent(payload)
this.incomingEvents()
(the returned event objects will each have a property)payload
this.memory()
this.memory(key)
this.memory(keyToSet, valueToSet)
this.setMemory(object)
(replaces the Agent’s memory with the provided object)this.deleteKey(key)
(deletes a key from memory and returns the value)this.credential(name)
this.credential(name, valueToSet)
this.options()
this.options(key)
this.log(message)
this.error(message)
this.escapeHtml(htmlToEscape)
this.unescapeHtml(htmlToUnescape)
Jq Agent
The Jq Agent allows you to process incoming Events with jq the JSON processor. (This agent is not enabled on this server)
It allows you to filter, transform and restructure Events in the way you want using jq’s powerful features.
You can specify a jq filter expression to apply to each incoming event in , and results it produces will become Events to be emitted.filter
You can optionally pass in variables to the filter program by specifying key-value pairs of a variable name and an associated value in the key, each of which becomes a predefined variable.variables
This Agent can be used to parse a complex JSON structure that is too hard to handle with JSONPath or Liquid templating.
For example, suppose that a Post Agent created an Event which contains a key with a value of the JSON formatted string of the following response body:body
{
"status": "1",
"since": "1245626956",
"list": {
"93817": {
"item_id": "93817",
"url": "http://url.com",
"title": "Page Title",
"time_updated": "1245626956",
"time_added": "1245626956",
"tags": "comma,seperated,list",
"state": "0"
},
"935812": {
"item_id": "935812",
"url": "http://google.com",
"title": "Google",
"time_updated": "1245635279",
"time_added": "1245635279",
"tags": "comma,seperated,list",
"state": "1"
}
}
}
Then you could have a Jq Agent with the following jq filter:
.body | fromjson | .list | to_entries | map(.value) | map(try(.tags |= split(",")) // .) | sort_by(.time_added | tonumber)
To get the following two Events emitted out of the said incoming Event from Post Agent:
[
{
"item_id": "93817",
"url": "http://url.com",
"title": "Page Title",
"time_updated": "1245626956",
"time_added": "1245626956",
"tags": ["comma", "seperated", "list"],
"state": "0"
},
{
"item_id": "935812",
"url": "http://google.com",
"title": "Google",
"time_updated": "1245626956",
"time_added": "1245626956",
"tags": ["comma", "seperated", "list"],
"state": "1"
}
]
Json Parse Agent
The JSON Parse Agent parses a JSON string and emits the data in a new event or merge with with the original event.
data
is the JSON to parse. Use Liquid templating to specify the JSON string.
data_key
sets the key which contains the parsed JSON data in emitted events
mode
determines whether create a new event or old payload with new values (default: clean``merge``clean
)
Manual Event Agent
The Manual Event Agent is used to manually create Events for testing or other purposes.
Connect this Agent to other Agents and create Events using the UI provided on this Agent’s Summary page.
You can set the default event payload via the “payload” option.
Mattermost Urls To Files
Takes a list of URLs, downloads them and then posts them as files to the described mattermost server, team, and channel. If message is defined the files are posted with the given message.
Phantom Js Cloud Agent
This Agent generates PhantomJs Cloud URLs that can be used to render JavaScript-heavy webpages for content extraction.
此代理生成 PhantomJs Cloud URL,可用于呈现内容提取的 JavaScript 繁多的网页。
URLs generated by this Agent are formulated in accordance with the PhantomJs Cloud API. The generated URLs can then be supplied to a Website Agent to fetch and parse the content.
这个代理生成的 URL 根据 PhantomJsCloudAPI 制定。然后生成的 URL 可以提供给一个网站代理来获取和解析内容。
Sign up to get an api key, and add it in Huginn credentials.
注册以获得 api 密钥,并将其添加到 Huginn 凭证中。
Please see the Huginn Wiki for more info.
更多信息请访问 Huginn Wiki。
Options:
选择:
Api key
- PhantomJs Cloud API Key credential stored in Huginn - 存放于 Huginn 的 PhantomJs Cloud API 密钥证书Url
- The url to render 要渲染的 URLMode
- Create a new - 创建一个新的clean
event or 事件或merge
old payload with new values (default: 带有新值的旧有效载荷(默认值:clean
)Render type
- Render as html, plain text without html tags, or jpg as screenshot of the page (default: - 呈现为 html,没有 html 标签的纯文本,或者 jpg 作为页面的屏幕快照(默认:html
)Output as json
- Return the page contents and metadata as a JSON object (default: - 将页面内容和元数据作为 JSON 对象返回(默认值:false
)Ignore images
- Skip loading of inlined images (default: - 跳过加载内联图像(默认值:false
)Url agent
- A custom User-Agent name (default: - 自定义 User-Agent 名称(默认值:Huginn - https://github.com/huginn/huginn
)Wait interval
- Milliseconds to delay rendering after the last resource is finished loading. This is useful in case there are any AJAX requests or animations that need to finish up. This can safely be set to 0 if you know there are no AJAX or animations you need to wait for (default: - 在最后一个资源加载完成后延迟呈现的毫秒。如果有任何 AJAX 请求或动画需要完成,这是非常有用的。
如果你知道没有 AJAX 或者动画需要等待(默认值:1000
ms) 女士)
As this agent only provides a limited subset of the most commonly used options, you can follow this guide to make full use of additional options PhantomJsCloud provides.
由于该代理只提供最常用选项的有限子集,您可以按照本指南充分利用 PhantomJsCloud 提供的其他选项。
Port Status Agent
The agent checks a port (TCP) for a specific host.
代理检查特定主机的端口(TCP)。
expected_receive_period_in_days
is used to determine if the Agent is working. Set it to the maximum number of days that you anticipate passing without this Agent receiving an incoming Event.
用于确定代理是否正在工作。将其设置为您预期传递的最大天数,而该代理没有接收到传入的事件
Post Agent
A Post Agent receives events from other agents (or runs periodically), merges those events with the Liquid-interpolated contents of payload
, and sends the results as POST (or GET) requests to a specified url. To skip merging in the incoming event, but still send the interpolated payload, set no_merge
to true
.
POST Agent 接收来自其他代理的事件(或定期运行) ,将这些事件与有效负载的液体内插内容合并,并将结果作为 POST (或 GET)请求发送到指定的 url。若要跳过传入事件中的合并,但仍然发送插入的有效负载,请将 no _ merge 设置为 true。
The post_url
field must specify where you would like to send requests. Please include the URI scheme (http
or https
).
Post _ url 字段必须指定要发送请求的位置。请包括 URI 方案(http 或 https)。
The method
used can be any of get
, post
, put
, patch
, and delete
.
使用的方法可以是 get、 post、 put、 patch 和 delete。
By default, non-GETs will be sent with form encoding (application/x-www-form-urlencoded
).
默认情况下,非 GET 将使用表单编码(application/x-www-form-urlencode)发送。
Change content_type
to json
to send JSON instead.
将 content _ type 更改为 JSON 以发送 JSON。
Change content_type
to xml
to send XML, where the name of the root element may be specified using xml_root
, defaulting to post
.
将 content _ type 更改为 XML 以发送 XML,其中可以使用 XML _ root 指定根元素的名称,默认为 post。
When content_type
contains a MIME type, and payload
is a string, its interpolated value will be sent as a string in the HTTP request’s body and the request’s Content-Type
HTTP header will be set to content_type
. When payload
is a string no_merge
has to be set to true
.
当 content _ type 包含 MIME 类型,并且有效负载是字符串时,其内插值将作为字符串在 HTTP 请求的主体中发送,并且请求的 Content-Type HTTP 头将设置为 content _ type。当有效负载为字符串时,no _ merge 必须设置为 true。
If emit_events
is set to true
, the server response will be emitted as an Event and can be fed to a WebsiteAgent for parsing (using its data_from_event
and type
options). No data processing will be attempted by this Agent, so the Event’s “body” value will always be raw text. The Event will also have a “headers” hash and a “status” integer value.
如果 sort _ events 设置为 true,服务器响应将作为 Event 发出,并且可以提供给 WebsiteAgent 进行解析(使用其 data _ from _ Event 和 type 选项)。此代理将不尝试进行任何数据处理,因此 Event 的“ body”值将始终是原始文本。事件还将有一个“ header”散列和一个“ status”整数值。
If output_mode
is set to merge
, the emitted Event will be merged into the original contents of the received Event.
如果 output _ mode 设置为 merge,则发出的 Event 将合并到接收到的 Event 的原始内容中。
Set event_headers
to a list of header names, either in an array of string or in a comma-separated string, to include only some of the header values.
将 event _ headers 设置为一个头名列表,以字符串数组或逗号分隔的字符串的形式设置,以便只包含一些头值。
Set event_headers_style
to one of the following values to normalize the keys of “headers” for downstream agents’ convenience:
将 event _ headers _ style 设置为下列值之一,以规范化“ header”的键,以方便下游代理:
capitalized
(default) - Header names are capitalized; e.g. “Content-Type” (默认)-标题名称大写; 例如“ Content-Type”downcased
- Header names are downcased; e.g. “content-type” - 标题名称缩写,例如“ content-type”snakecased
- Header names are snakecased; e.g. “content_type” - 标题名称以蛇形字母表示; 例如“ content _ type”raw
- Backward compatibility option to leave them unmodified from what the underlying HTTP library returns. - 向下兼容选项,不修改基础 HTTP 库返回的内容
Other Options:
其他选择:
headers
- When present, it should be a hash of headers to send with the request. - 当出现时,它应该是一个与请求一起发送的头的散列basic_auth
- Specify HTTP basic auth parameters: - 指定 HTTP 基本认证参数:"username:password"
, or 或者["username", "password"]
.disable_ssl_verification
- Set to 准备好了true
to disable ssl verification. 禁用 ssl 验证user_agent
- A custom User-Agent name (default: “Faraday v0.12.1”). - 自定义 User-Agent 名称(默认值: “ Faraday v0.12.1”)
This agent can consume a ‘file pointer’ event from the following agents with no additional configuration: FtpsiteAgent
, S3Agent
, LocalFileAgent
. Read more about the concept in the wiki.
这个代理可以使用来自以下代理的“文件指针”事件,不需要其他配置: FtpsiteAgent、 S3Agent、 LocalFileAgent。在 wiki 中阅读更多关于这个概念的内容。
When receiving a file_pointer
the request will be sent with multipart encoding (multipart/form-data
) and content_type
is ignored. upload_key
can be used to specify the parameter in which the file will be sent, it defaults to file
.
当接收到 file _ point 时,请求将使用多部分编码(multipart/form-data)发送,而 content _ type 将被忽略。Load _ key 可以用来指定文件将在其中发送的参数,它默认为 file。
Raw Webhook Agent
The Raw Webhook Agent will create events by receiving webhooks from any source. In order to create events with this agent, make a POST request to:
Raw Webhook 代理将通过从任何源接收 Webhook 来创建事件。要使用此代理创建事件,请发出 POST 请求:
https:///users/1/web_requests/:id/:secret
The placeholder symbols above will be replaced by their values once the agent is saved.
Https:///users/1/web_requests/:id/:secret 代理保存后,上面的占位符将被它们的值替换。
Options:
选择:
secret
(required) - A token that the host will provide for authentication. (必需)-主机将为身份验证提供的令牌expected_receive_period_in_days
(required) - How often you expect to receive events this way. Used to determine if the agent is working. (必需)-以这种方式接收事件的频率。用于确定代理是否正常工作verbs
- Comma-separated list of http verbs your agent will accept. For example, “post,get” will enable POST and GET requests. Defaults to “post”. - 逗号分隔的 http 动词列表,您的代理将接受。例如,“ POST,GET”将启用 POST 和 GET 请求。默认为“ post”response
- The response message to the request. Defaults to ‘Event Created’. - 请求的响应消息。默认为“已创建事件”response_headers
- An object with any custom response headers. (example: - 具有任何自定义响应标头的对象(例如:{"Access-Control-Allow-Origin": "*"}
)code
- The response code to the request. Defaults to ‘201’. If the code is ‘301’ or ‘302’ the request will automatically be redirected to the url defined in “response”. - 请求的响应代码。默认为“201”。如果代码是“301”或“302”,请求将自动重定向到“ response”中定义的 URLrecaptcha_secret
- Setting this to a reCAPTCHA “secret” key makes your agent verify incoming requests with reCAPTCHA. Don’t forget to embed a reCAPTCHA snippet including your “site” key in the originating form(s). - 将此设置为 reCAPTCHA“ secret”密钥,使您的代理使用 reCAPTCHA 验证传入请求。不要忘记在原始表单中嵌入包含“ site”键的 reCAPTCHA 代码片段recaptcha_send_remote_addr
- Set this to true if your server is properly configured to set REMOTE_ADDR to the IP address of each visitor (instead of that of a proxy server). - 如果您的服务器正确地配置为将 REMOTE _ ADDR 设置为每个访问者的 IP 地址(而不是代理服务器的 IP 地址) ,则将此设置为 trueforce_encoding
- Set this to override the automatic detection of request encoding. (example: - 将此设置为覆盖请求编码的自动检测(例如:UTF-8
)content_type
- Override the value of Content-Type header in the response. (example: - 重写响应中 Content-Type 头的值(例如:application/json
, default: 默认值:text/plain
)
Read File Agent
The ReadFileAgent takes events from FileHandling
agents, reads the file, and emits the contents as a string.
ReadFileAgent 接受来自 FileProcessing 代理的事件,读取文件,并将内容作为字符串发出。
data_key
specifies the key of the emitted event which contains the file contents.
Data _ key 指定包含文件内容的发出事件的键。
This agent can consume a ‘file pointer’ event from the following agents with no additional configuration: FtpsiteAgent
, S3Agent
, LocalFileAgent
. Read more about the concept in the wiki.
这个代理可以使用来自以下代理的“文件指针”事件,不需要其他配置: FtpsiteAgent、 S3Agent、 LocalFileAgent。在 wiki 中阅读更多关于这个概念的内容。
Rss Agent
The RSS Agent consumes RSS feeds and emits events when they change.
RSS 代理使用 RSS 提要并在事件发生更改时发出事件。
This agent, using Feedjira as a base, can parse various types of RSS and Atom feeds and has some special handlers for FeedBurner, iTunes RSS, and so on. However, supported fields are limited by its general and abstract nature. For complex feeds with additional field types, we recommend using a WebsiteAgent. See this example.
这个代理使用 Feedjira 作为基础,可以解析各种类型的 RSS 和 Atom 提要,并为 FeedBurner、 iTunes RSS 等提供了一些特殊的处理程序。但是,受支持的字段受到其一般性和抽象性的限制。对于具有其他字段类型的复杂提要,建议使用 WebsiteAgent。看这个例子。
If you want to output an RSS feed, use the DataOutputAgent.
如果要输出 RSS 提要,请使用 DataOutputAgent。
Options:
选择:
url
- The URL of the RSS feed (an array of URLs can also be used; items with identical guids across feeds will be considered duplicates). - RSS 提要的 URL (也可以使用 URL 数组; 提要之间具有相同指南的项目将被视为重复)include_feed_info
- Set to 准备好了true
to include feed information in each event. 在每个事件中包含提要信息clean
- Set to 准备好了true
to sanitize 消毒description
and 还有content
as HTML fragments, removing unknown/unsafe elements and attributes. 作为 HTML 片段,删除未知/不安全的元素和属性expected_update_period_in_days
- How often you expect this RSS feed to change. If more than this amount of time passes without an update, the Agent will mark itself as not working. - 你期望这个 RSS 频道多久更新一次。如果超过这个数量的时间没有更新,代理将标记自己不工作headers
- When present, it should be a hash of headers to send with the request. - 当出现时,它应该是一个与请求一起发送的头的散列basic_auth
- Specify HTTP basic auth parameters: - 指定 HTTP 基本认证参数:"username:password"
, or 或者["username", "password"]
.disable_ssl_verification
- Set to 准备好了true
to disable ssl verification. 禁用 ssl 验证disable_url_encoding
- Set to 准备好了true
to disable url encoding. 禁用 url 编码force_encoding
- Set 好了force_encoding
to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1). 如果已知网站在内容类型标题中回应缺少、无效或错误的字符集,则返回一个编码名称。注意,没有字符集的文本内容采用 UTF-8编码(而不是 ISO-8859-1)user_agent
- A custom User-Agent name (default: “Faraday v0.12.1”). - 自定义 User-Agent 名称(默认值: “ Faraday v0.12.1”)max_events_per_run
- Limit number of events created (items parsed) per run for feed. - 限制为提要每次运行创建的事件数(已解析的项)remembered_id_count
- Number of IDs to keep track of and avoid re-emitting (default: 500). - 要跟踪和避免重新发出的 ID 数目(默认值: 500)
Ordering Events
订购活动
To specify the order of events created in each run, set events_order
to an array of sort keys, each of which looks like either expression
or [expression, type, descending]
, as described as follows:
要指定每次运行中创建的事件顺序,请将 events _ order 设置为一个排序键数组,每个键看起来都像表达式或[ expression,type, 下降] ,如下所述:
-
expression is a Liquid template to generate a string to be used as sort key.
表达式是一个用于生成用作排序键的字符串的液体模板。
-
type (optional) is one of
string
(default),number
andtime
, which specifies how to evaluate expression for comparison.Type (可选)是字符串(缺省值)、数字和时间之一,它指定如何计算表达式以进行比较。
-
descending (optional) is a boolean value to determine if comparison should be done in descending (reverse) order, which defaults to
false
.降序(可选)是一个布尔值,用于确定比较是否应该按降序(反向)进行,默认为 false。
Sort keys listed earlier take precedence over ones listed later. For example, if you want to sort articles by the date and then by the author, specify [["{{date}}", "time"], "{{author}}"]
.
前面列出的排序键优先于后面列出的排序键。例如,如果希望按日期然后按作者对文章进行排序,请指定[[“{{ date }”,“ time”] ,“{{ author }}”]。
Sorting is done stably, so even if all events have the same set of sort key values the original order is retained. Also, a special Liquid variable _index_
is provided, which contains the zero-based index number of each event, which means you can exactly reverse the order of events by specifying [["{{_index_}}", "number", true]]
.
排序是稳定地进行的,因此即使所有事件具有相同的排序键值集合,也会保留原始顺序。此外,还提供了一个特殊的肃变量 _ index _,它包含每个事件的从零开始的索引号,这意味着您可以通过指定[[“{ _ index _ }”,“ number”,true ]]来精确地逆转事件的顺序。
If the include_sort_info
option is set, each created event will have a sort_info
key whose value is a hash containing the following keys:
如果设置了 include _ sort _ info 选项,那么每个创建的事件都将有一个 sort _ info 键,其值是一个包含以下键的散列值:
position
: 1-based index of each event after the sort : 排序后每个事件的从1开始的索引count
: Total number of events sorted : 已排序的事件总数
In this Agent, the default value for events_order
is [["{{date_published}}","time"],["{{last_updated}}","time"]]
.
在此代理中,events _ order 的默认值是[[“{ date _ publications }}”,“ time”] ,[“{ last _ update }}”,“ time”]]。
Scheduler Agent
The Scheduler Agent periodically takes an action on target Agents according to a user-defined schedule.
调度代理根据用户定义的调度定期对目标代理执行操作。
Action types
行动类型
Set action
to one of the action types below:
将操作设置为以下操作类型之一:
-
run
: Target Agents are run at intervals, except for those disabled.Run: 目标代理每隔一段时间运行一次,但禁用的代理除外。
-
disable
: Target Agents are disabled (if not) at intervals.禁用: 目标代理每隔一段时间禁用一次(如果没有的话)。
-
enable
: Target Agents are enabled (if not) at intervals.启用: 目标代理每隔一段时间启用一次(如果没有的话)。
- If the option 如果选择
drop_pending_events
is set to 被设置为true
, pending events will be cleared before the agent is enabled. ,将在启用代理之前清除挂起的事件
- If the option 如果选择
Targets
目标
Select Agents that you want to run periodically by this SchedulerAgent.
选择您希望由此调度代理定期运行的代理。
Schedule
时间表
Set schedule
to a schedule specification in the cron format. For example:
以 cron 格式将计划设置为计划规范。例如:
-
0 22 * * 1-5
: every day of the week at 22:00 (10pm)022 * * 1-5: 每日22时(晚上10时)
-
*/10 8-11 * * *
: every 10 minutes from 8:00 to and not including 12:00*/108-11 * * * : 每10分钟8:00至12:00(不包括12:00)
This variant has several extensions as explained below.
这个变体有几个扩展,如下所述。
Timezones
时区
You can optionally specify a timezone (default: Pacific Time (US & Canada)
) after the day-of-week field using the labels in the tz database
您可以选择使用 tz 数据库中的标签在“每周一天”字段之后指定时区(默认为 Pacific Time (US & Canada))
-
0 22 * * 1-5 Europe/Paris
: every day of the week when it’s 22:00 in Paris1-5欧洲/巴黎: 一周中的每一天,当巴黎时间是22:00
-
0 22 * * 1-5 Etc/GMT+2
: every day of the week when it’s 22:00 in GMT+2022 * * 1-5等等/格林尼治时间 + 2: 一周中的每一天,格林尼治时间 + 2是22:00
Seconds
几秒钟
You can optionally specify seconds before the minute field.
可以选择在分钟字段之前指定秒。
*/30 * * * * *
: every 30 seconds 每30秒
Only multiples of fifteen are allowed as values for the seconds field, i.e. */15
, */30
, 15,45
etc.
只有15的倍数可以作为秒字段的值,比如 /15,/30,15,45等等。
Last day of month
每月最后一天
L
signifies “last day of month” in day-of-month
.
L 表示“每月最后一天”在每月的一天。
0 22 L * *
: every month on the last day at 22:00 : 每月最后一天22:00
Weekday names
工作日的名字
You can use three letter names instead of numbers in the weekdays
field.
可以在工作日字段中使用三个字母名称而不是数字。
0 22 * * Sat,Sun
: every Saturday and Sunday, at 22:00 : 每周六和周日22:00
Nth weekday of the month
每月的第九个工作日
You can specify “nth weekday of the month” like this.
您可以像下面这样指定“每月的第 n 个工作日”。
-
0 22 * * Sun#1,Sun#2
: every first and second Sunday of the month, at 22:00太阳1号,太阳2号: 每月的第一个和第二个星期天,22:00
-
0 22 * * Sun#L1
: every last Sunday of the month, at 22:00每月的最后一个星期天,22:00
Trigger Agent
The Trigger Agent will watch for a specific value in an Event payload.
触发器代理将监视事件负载中的特定值。
The rules
array contains a mixture of strings and hashes.
Rules 数组包含字符串和散列的混合。
A string rule is a Liquid template and counts as a match when it expands to true
.
一个字符串规则是一个液体模板,当它展开为 true 时将被视为匹配。
A hash rule consists of the following keys: path
, value
, and type
.
散列规则由以下键组成: path、 value 和 type。
The path
value is a dotted path through a hash in JSONPaths syntax. For simple events, this is usually just the name of the field you want, like ‘text’ for the text key of the event.
Path 值是通过 JSONPath 语法中的散列的点状路径。对于简单的事件,这通常只是所需字段的名称,比如事件的文本键为“ text”。
The type
can be one of regex
, !regex
, field<value
, field<=value
, field==value
, field!=value
, field>=value
, field>value
, and not in
and compares with the value
. Note that regex patterns are matched case insensitively. If you want case sensitive matching, prefix your pattern with (?-i)
.
该类型可以是正则表达式之一,!Regex,field < value,field < = value,field = = value,field!= value,field > = value,field > value,而不是 in 并与值进行比较。注意,正则表达式模式是不区分大小写的匹配。如果需要区分大小写的匹配,请在模式前面加上(?- i).
In any type
including regex Liquid variables can be used normally. To search for just a word matching the concatenation of foo
and variable bar
would use value
of foo{{bar}}
. Note that note that starting/ending delimiters like /
or |
are not required for regex.
在包括正则表达式在内的任何类型中,液态变量都可以正常使用。若要仅搜索与 foo 和变量 bar 的连接相匹配的单词,请使用 foo { bar }的值。注意,正则表达式不需要像/或 | 这样的起始/结束分隔符。
The value
can be a single value or an array of values. In the case of an array, all items must be strings, and if one or more values match, then the rule matches. Note: avoid using field!=value
with arrays, you should use not in
instead.
该值可以是单个值或值数组。对于数组,所有项都必须是字符串,如果一个或多个值匹配,则规则匹配。注意: 避免使用字段!= value,则应该使用 not in。
By default, all rules must match for the Agent to trigger. You can switch this so that only one rule must match by setting must_match
to 1
.
默认情况下,代理要触发的所有规则都必须匹配。可以通过将 must _ match 设置为1来切换这一点,以便只有一个规则必须匹配。
The resulting Event will have a payload message of message
. You can use liquid templating in the `message, have a look at the Wiki for details.
结果事件将有一个消息的有效负载消息。你可以在“消息”中使用液体模板,详细信息可以查看 Wiki。
Set keep_event
to true
if you’d like to re-emit the incoming event, optionally merged with ‘message’ when provided.
如果您想重新发送传入事件,请将 keep _ event 设置为 true,并在提供时可选择与“ message”合并。
Set expected_receive_period_in_days
to the maximum amount of time that you’d expect to pass between Events being received by this Agent.
将 代理所接受的事件之間的最大期望時間設定為 _ 天 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _。
Webhook Agent Webhook
The Webhook Agent will create events by receiving webhooks from any source. In order to create events with this agent, make a POST request to:
Webhook 代理将通过从任何源接收 Webhook 来创建事件。要使用此代理创建事件,请发出 POST 请求:
https:///users/1/web_requests/:id/:secret
The placeholder symbols above will be replaced by their values once the agent is saved.
一旦代理被保存,上面的占位符将被它们的值替换。
Options:
选择:
secret
- A token that the host will provide for authentication. - 主机将为身份验证提供的令牌expected_receive_period_in_days
- How often you expect to receive events this way. Used to determine if the agent is working. - 您期望以这种方式接收事件的频率。用于确定代理是否正在工作payload_path
- JSONPath of the attribute in the POST body to be used as the Event payload. Set to - POST 主体中用作事件有效负载的属性的 JSONPath.
to return the entire message. If 返回整个消息。如果payload_path
points to an array, Events will be created for each element. 指向数组时,将为每个元素创建事件event_headers
- Comma-separated list of HTTP headers your agent will include in the payload. - 以逗号分隔的 HTTP 头列表,您的代理将包括在有效负载中event_headers_key
- The key to use to store all the headers received - 用于存储所有接收到的头的密钥verbs
- Comma-separated list of http verbs your agent will accept. For example, “post,get” will enable POST and GET requests. Defaults to “post”. - 逗号分隔的 http 动词列表,您的代理将接受。例如,“ POST,GET”将启用 POST 和 GET 请求。默认为“ post”response
- The response message to the request. Defaults to ‘Event Created’. - 请求的响应消息。默认为“已创建事件”response_headers
- An object with any custom response headers. (example: - 具有任何自定义响应标头的对象(例如:{"Access-Control-Allow-Origin": "*"}
)code
- The response code to the request. Defaults to ‘201’. If the code is ‘301’ or ‘302’ the request will automatically be redirected to the url defined in “response”. - 请求的响应代码。默认为“201”。如果代码是“301”或“302”,请求将自动重定向到“ response”中定义的 URLrecaptcha_secret
- Setting this to a reCAPTCHA “secret” key makes your agent verify incoming requests with reCAPTCHA. Don’t forget to embed a reCAPTCHA snippet including your “site” key in the originating form(s). - 将此设置为 reCAPTCHA“ secret”密钥,使您的代理使用 reCAPTCHA 验证传入请求。不要忘记在原始表单中嵌入包含“ site”键的 reCAPTCHA 代码片段recaptcha_send_remote_addr
- Set this to true if your server is properly configured to set REMOTE_ADDR to the IP address of each visitor (instead of that of a proxy server). - 如果您的服务器正确地配置为将 REMOTE _ ADDR 设置为每个访问者的 IP 地址(而不是代理服务器的 IP 地址) ,则将此设置为 truescore_threshold
- Setting this when using reCAPTCHA v3 to define the treshold when a submission is verified. Defaults to 0.5 - 在验证提交时使用 reCAPTCHA v3定义阈值时设置此选项。默认值为0.5
Website Agent
The Website Agent scrapes a website, XML document, or JSON feed and creates Events based on the results.
网站代理抓取网站、 XML 文档或 JSON 提要,并根据结果创建 Events。
Specify a url
and select a mode
for when to create Events based on the scraped data, either all
, on_change
, or merge
(if fetching based on an Event, see below).
指定一个 url 并选择一种模式,以便何时基于刮取的数据创建 Events,可以是 all、 on _ change 或 merge (如果基于 Event 进行抓取,请参见下文)。
The url
option can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape).
Url 选项可以是单个 url,也可以是 url 数组(例如,对于具有完全相同的结构但需要刮取的内容不同的多个页面)。
The WebsiteAgent can also scrape based on incoming events.
WebsiteAgent 还可以根据传入的事件进行刮取。
- Set the 设置
url_from_event
option to a 选择Liquid 液体 template to generate the url to access based on the Event. (To fetch the url in the Event’s 模板生成要基于事件访问的 URLurl
key, for example, set 例如,设置url_from_event
to 到{{ url }}
.) - Alternatively, set 或者,设置
data_from_event
to a 到一个Liquid 液体 template to use data directly without fetching any URL. (For example, set it to 模板直接使用数据而无需获取任何 URL。(例如,将其设置为{{ html }}
to use HTML contained in the 中包含的 HTMLhtml
key of the incoming Event.) 传入事件的键。) - If you specify 如果您指定
merge
for the 为了mode
option, Huginn will retain the old payload and update it with new values. 选项,Huginn 将保留旧的有效负载并用新值更新它
Supported Document Types
支持的文档类型
The type
value can be xml
, html
, json
, or text
.
类型值可以是 xml、 html、 json 或 text。
To tell the Agent how to parse the content, specify extract
as a hash with keys naming the extractions and values of hashes.
若要告诉代理如何解析内容,请将提取指定为散列,并使用键命名散列的提取和值。
Note that for all of the formats, whatever you extract MUST have the same number of matches for each extractor except when it has repeat
set to true. E.g., if you’re extracting rows, all extractors must match all rows. For generating CSS selectors, something like SelectorGadget may be helpful.
请注意,对于所有的格式,无论您提取什么,每个提取器都必须具有相同数量的匹配,除非它将 repeat 设置为 true。例如,如果要提取行,则所有提取器必须匹配所有行。对于生成 CSS 选择器,像 SelectorGadget 这样的东西可能会很有帮助。
For extractors with hidden
set to true, they will be excluded from the payloads of events created by the Agent, but can be used and interpolated in the template
option explained below.
对于隐藏设置为 true 的提取器,它们将被排除在 Agent 创建的事件的有效负载之外,但是可以在下面解释的模板选项中使用和插值。
For extractors with repeat
set to true, their first matches will be included in all extracts. This is useful such as when you want to include the title of a page in all events created from the page.
对于将 repeat 设置为 true 的提取器,它们的第一个匹配将包括在所有提取中。这非常有用,例如当您希望在从页面创建的所有事件中包含页面的标题时。
Scraping HTML and XML
抓取 HTML 和 XML
When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in css
or an XPath expression in xpath
. It then evaluates an XPath expression in value
(default: .
) on each node in the node set, converting the result into a string. Here’s an example:
在解析 HTML 或 XML 时,这些子哈希指定每个提取应该如何进行。Agent 首先通过计算 CSS 中的 CSS 选择器或 XPath 中的 XPath 表达式,从文档中为每个提取键选择一个节点集。然后计算 XPath 表达式的值(缺省值: .)在节点集中的每个节点上,将结果转换为字符串。这里有一个例子:
"extract": {
"url": { "css": "#comic img", "value": "@src" },
"title": { "css": "#comic img", "value": "@title" },
"body_text": { "css": "div.main", "value": "string(.)" },
"page_title": { "css": "title", "value": "string(.)", "repeat": true }
} or
"extract": {
"url": { "xpath": "//*[@class="blog-item"]/a/@href", "value": "."
"title": { "xpath": "//*[@class="blog-item"]/a", "value": "normalize-space(.)" },
"description": { "xpath": "//*[@class="blog-item"]/div[0]", "value": "string(.)" }
}
“@attr” is the XPath expression to extract the value of an attribute named attr from a node (such as “@href” from a hyperlink), and string(.)
gives a string with all the enclosed text nodes concatenated without entity escaping (such as &
). To extract the innerHTML, use ./node()
; and to extract the outer HTML, use .
.
“@attr”是一个 XPath 表达式,用于从节点(如超链接中的“@href”)和 string (.)中提取名为 attr 的属性的值给出一个字符串,其中所有封闭的文本节点连接在一起,没有实体转义(例如 &)。要提取 innerHTML,请使用。/node () ; 要提取外部 HTML,请使用。.
You can also use XPath functions like normalize-space
to strip and squeeze whitespace, substring-after
to extract part of a text, and translate
to remove commas from formatted numbers, etc. Instead of passing string(.)
to these functions, you can just pass .
like normalize-space(.)
and translate(., ',', '')
.
您还可以使用 XPath 函数,比如 norize-space 来去除和压缩空格,substring-after 来提取文本的一部分,翻译来删除格式化数字中的逗号,等等。而不是传递 string (.)传递给这些函数。比如标准化空间(.)翻译。,',','').
Beware that when parsing an XML document (i.e. type
is xml
) using xpath
expressions, all namespaces are stripped from the document unless the top-level option use_namespaces
is set to true
.
请注意,当使用 xpath 表达式解析 XML 文档(即 type is XML)时,除非顶级选项 use _ nampace 设置为 true,否则将从文档中剥离所有名称空间。
For extraction with array
set to true, all matches will be extracted into an array. This is useful when extracting list elements or multiple parts of a website that can only be matched with the same selector.
对于将数组设置为 true 的提取,所有匹配都将提取到一个数组中。这在提取列表元素或网站的多个部分时非常有用,因为这些元素只能与同一个选择器匹配。
Scraping JSON
擦除 JSON
When parsing JSON, these sub-hashes specify JSONPaths to the values that you care about.
在解析 JSON 时,这些子哈希将 JSONPath 指定给您关心的值。
Sample incoming event:
传入事件样本:
{ "results": {
"data": [
{
"title": "Lorem ipsum 1",
"description": "Aliquam pharetra leo ipsum."
"price": 8.95
},
{
"title": "Lorem ipsum 2",
"description": "Suspendisse a pulvinar lacus."
"price": 12.99
},
{
"title": "Lorem ipsum 3",
"description": "Praesent ac arcu tellus."
"price": 8.99
}
]
}
}
Sample rule:
示例规则:
"extract": {
"title": { "path": "results.data[*].title" },
"description": { "path": "results.data[*].description" }
}
In this example the *
wildcard character makes the parser to iterate through all items of the data
array. Three events will be created as a result.
在这个例子中,* 通配符让解析器遍历数据数组的所有项目。结果将创建三个事件。
Sample outgoing events:
外发事件样本:
[
{
"title": "Lorem ipsum 1",
"description": "Aliquam pharetra leo ipsum."
},
{
"title": "Lorem ipsum 2",
"description": "Suspendisse a pulvinar lacus."
},
{
"title": "Lorem ipsum 3",
"description": "Praesent ac arcu tellus."
}
]
The extract
option can be skipped for the JSON type, causing the full JSON response to be returned.
可以跳过 JSON 类型的提取选项,从而返回完整的 JSON 响应。
Scraping Text
刮取文字
When parsing text, each sub-hash should contain a regexp
and index
. Output text is matched against the regular expression repeatedly from the beginning through to the end, collecting a captured group specified by index
in each match. Each index should be either an integer or a string name which corresponds to (?<*name*>...)
. For example, to parse lines of *word*: *definition*
, the following should work:
解析文本时,每个子散列应包含一个 regexp 和索引。输出文本从头到尾反复地与正则表达式进行匹配,收集在每次匹配中由索引指定的捕获的组。每个索引要么是一个整数,要么是一个对应于(?< 名称 > ...)。例如,要解析 word: Definition 的行,应该使用以下代码:
"extract": {
"word": { "regexp": "^(.+?): (.+)$", "index": 1 },
"definition": { "regexp": "^(.+?): (.+)$", "index": 2 }
}
Or if you prefer names to numbers for index:
或者如果你喜欢名字而不喜欢数字作为索引:
"extract": {
"word": { "regexp": "^(?<word>.+?): (?<definition>.+)$", "index": "word" },
"definition": { "regexp": "^(?<word>.+?): (?<definition>.+)$", "index": "definition" }
}
To extract the whole content as one event:
将整个内容作为一个事件提取:
"extract": {
"content": { "regexp": "\A(?m:.)*\z", "index": 0 }
}
Beware that .
does not match the newline character (LF) unless the m
flag is in effect, and ^
/$
basically match every line beginning/end. See this document to learn the regular expression variant used in this service.
小心点。不匹配换行符(LF) ,除非 m 标志有效,并且 ^/$基本上匹配每一行开始/结束。请参阅此文档以了解此服务中使用的正则表达式变量。
General Options
一般方案
Can be configured to use HTTP basic auth by including the basic_auth
parameter with "username:password"
, or ["username", "password"]
.
可以通过包含带有“ username: password”或[“ username”,“ password”]的 basic _ auth 参数来配置为使用 HTTP 基本身份验证。
Set expected_update_period_in_days
to the maximum amount of time that you’d expect to pass between Events being created by this Agent. This is only used to set the “working” status.
将 件 _ update _ period _ in _ days 设置为您预期在此代理创建的 Events 之间传递的最大时间量。这只用于设置“工作”状态。
Set uniqueness_look_back
to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of 200 or 3x the number of detected received results.
设置惟一性 _ look _ back 以限制检查唯一性的事件数(通常是为了性能)。默认值为检测到的结果数量的200或3倍。
Set force_encoding
to an encoding name (such as UTF-8
and ISO-8859-1
) if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header. Below are the steps used by Huginn to detect the encoding of fetched content:
如果已知网站在 Content-Type 头中响应缺少、无效或错误的字符集,则将 force _ coding 设置为编码名称(如 UTF-8和 ISO-8859-1)。下面是 Huginn 用来检测获取内容的编码的步骤:
- If 如果
force_encoding
is given, that value is used. 则使用该值 - If the Content-Type header contains a charset parameter, that value is used. 如果 Content-Type 标头包含字符集参数,则使用该值
- When 什么时候
type
is 是html
or 或者xml
, Huginn checks for the presence of a BOM, XML declaration with attribute “encoding”, or an HTML meta tag with charset information, and uses that if found. ,Huginn 检查是否存在 BOM、带有属性“编码”的 XML 声明或带有字符集信息的 HTML 元标记,如果找到,则使用该元标记 - Huginn falls back to UTF-8 (not ISO-8859-1). Huginn 回到 UTF-8(不是 ISO-8859-1)
Set user_agent
to a custom User-Agent name if the website does not like the default value (Huginn - https://github.com/huginn/huginn
).
如果网站不喜欢默认值,则将 user _ agent 设置为自定义 User-Agent 名称(Huginn- https://github.com/Huginn/Huginn )。
The headers
field is optional. When present, it should be a hash of headers to send with the request.
头字段是可选的。当出现时,它应该是一个随请求一起发送的头的散列表。
Set disable_ssl_verification
to true
to disable ssl verification.
如果要禁用 ssl 验证,请将 able _ ssl _ validation 设置为 true。
Set unzip
to gzip
to inflate the resource using gzip.
将 unzip 设置为 gzip 以使用 gzip 充分利用资源。
Set http_success_codes
to an array of status codes (e.g., [404, 422]
) to treat HTTP response codes beyond 200 as successes.
将 HTTP _ Success _ code 设置为一组状态代码(例如[404,422]) ,将超过200个的 HTTP 响应代码视为成功。
If a template
option is given, its value must be a hash, whose key-value pairs are interpolated after extraction for each iteration and merged with the payload. In the template, keys of extracted data can be interpolated, and some additional variables are also available as explained in the next section. For example:
如果给定了模板选项,那么它的值必须是散列,其键值对在每次迭代提取之后进行插值,并与有效负载合并。在模板中,可以对提取数据的键进行插值,还可以使用一些其他变量,如下一节所述。例如:
"template": {
"url": "{{ url | to_uri: _response_.url }}",
"description": "{{ body_text }}",
"last_modified": "{{ _response_.headers.Last-Modified | date: '%FT%T' }}"
}
In the on_change
mode, change is detected based on the resulted event payload after applying this option. If you want to add some keys to each event but ignore any change in them, set mode
to all
and put a DeDuplicationAgent downstream.
在 on _ change 模式下,应用此选项后,将根据所产生的事件有效负载检测更改。如果希望为每个事件添加一些键,但忽略其中的任何更改,请将模式设置为 all 并将 DeDuplicationAgent 放在下游。
Liquid Templating
液体模板
In Liquid templating, the following variables are available:
在液体模板中,下列变量是可用的:
-
_url_
: The URL specified to fetch the content from. When parsingdata_from_event
, this is not set._ URL _: 指定从中获取内容的 URL。在解析 data _ from _ event 时,未设置此值。
-
_response_
: A response object with the following keys:_ response _: 具有以下键的响应对象:
-
status
: HTTP status as integer. (Almost always 200) When parsingdata_from_event
, this is set to the value of thestatus
key in the incoming Event, if it is a number or a string convertible to an integer.状态: HTTP 状态为整数。(几乎总是200)在解析 data _ from _ Event 时,如果它是一个可转换为整数的数字或字符串,则将其设置为传入 Event 中的状态键的值。
-
headers
: Response headers; for example,{{ _response_.headers.Content-Type }}
expands to the value of the Content-Type header. Keys are insensitive to cases and -/_. When parsingdata_from_event
, this is constructed from the value of theheaders
key in the incoming Event, if it is a hash.Header: 响应 Header; 例如,{{ _ Response 。标题。{ Content-Type }}扩展为 Content-Type 标头的值。键对大小写和-/ 不敏感。在解析 data _ from _ Event 时,如果它是一个散列,那么它是由传入 Event 中的 Header 键的值构造的。
-
url
: The final URL of the fetched page, following redirects. When parsingdata_from_event
, this is set to the value of theurl
key in the incoming Event. Using this in thetemplate
option, you can resolve relative URLs extracted from a document like{{ link | to_uri: _response_.url }}
and{{ content | rebase_hrefs: _response_.url }}
.URL: 获取的页面的最终 URL,重定向之后。在解析 data _ from _ Event 时,这被设置为传入 Event 中的 url 键的值。在模板选项中使用此选项,可以解析从文档(如{{ link | to _ uri: _ response _)提取的相对 URL。}和{{ content | rebase _ hrefs: _ response _。你好。
-
Ordering Events
订购活动
To specify the order of events created in each run, set events_order
to an array of sort keys, each of which looks like either expression
or [expression, type, descending]
, as described as follows:
要指定每次运行中创建的事件顺序,请将 events _ order 设置为一个排序键数组,每个键看起来都像表达式或[ expression,type, 下降] ,如下所述:
-
expression is a Liquid template to generate a string to be used as sort key.
表达式是一个用于生成用作排序键的字符串的液体模板。
-
type (optional) is one of
string
(default),number
andtime
, which specifies how to evaluate expression for comparison.Type (可选)是字符串(缺省值)、数字和时间之一,它指定如何计算表达式以进行比较。
-
descending (optional) is a boolean value to determine if comparison should be done in descending (reverse) order, which defaults to
false
.降序(可选)是一个布尔值,用于确定比较是否应该按降序(反向)进行,默认为 false。
Sort keys listed earlier take precedence over ones listed later. For example, if you want to sort articles by the date and then by the author, specify [["{{date}}", "time"], "{{author}}"]
.
前面列出的排序键优先于后面列出的排序键。例如,如果希望按日期然后按作者对文章进行排序,请指定[[“{{ date }”,“ time”] ,“{{ author }}”]。
Sorting is done stably, so even if all events have the same set of sort key values the original order is retained. Also, a special Liquid variable _index_
is provided, which contains the zero-based index number of each event, which means you can exactly reverse the order of events by specifying [["{{_index_}}", "number", true]]
.
排序是稳定地进行的,因此即使所有事件具有相同的排序键值集合,也会保留原始顺序。此外,还提供了一个特殊的肃变量 _ index _,它包含每个事件的从零开始的索引号,这意味着您可以通过指定[[“{ _ index _ }”,“ number”,true ]]来精确地逆转事件的顺序。
If the include_sort_info
option is set, each created event will have a sort_info
key whose value is a hash containing the following keys:
如果设置了 include _ sort _ info 选项,那么每个创建的事件都将有一个 sort _ info 键,其值是一个包含以下键的散列值:
position
: 1-based index of each event after the sort : 排序后每个事件的从1开始的索引count
: Total number of events sorted : 已排序的事件总数
The Website Agent scrapes a website, XML document, or JSON feed and creates Events based on the results.
网站代理抓取网站、 XML 文档或 JSON 提要,并根据结果创建 Events。
Specify a url
and select a mode
for when to create Events based on the scraped data, either all
, on_change
, or merge
(if fetching based on an Event, see below).
指定一个 url 并选择一种模式,以便何时基于刮取的数据创建 Events,可以是 all、 on _ change 或 merge (如果基于 Event 进行抓取,请参见下文)。
The url
option can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape).
Url 选项可以是单个 url,也可以是 url 数组(例如,对于具有完全相同的结构但需要刮取的内容不同的多个页面)。
The WebsiteAgent can also scrape based on incoming events.
WebsiteAgent 还可以根据传入的事件进行刮取。
- Set the 设置
url_from_event
option to a 选择Liquid 液体 template to generate the url to access based on the Event. (To fetch the url in the Event’s 模板生成要基于事件访问的 URLurl
key, for example, set 例如,设置url_from_event
to 到{{ url }}
.) - Alternatively, set 或者,设置
data_from_event
to a 到一个Liquid 液体 template to use data directly without fetching any URL. (For example, set it to 模板直接使用数据而无需获取任何 URL。(例如,将其设置为{{ html }}
to use HTML contained in the 中包含的 HTMLhtml
key of the incoming Event.) 传入事件的键。) - If you specify 如果您指定
merge
for the 为了mode
option, Huginn will retain the old payload and update it with new values. 选项,Huginn 将保留旧的有效负载并用新值更新它
Supported Document Types
支持的文档类型
The type
value can be xml
, html
, json
, or text
.
类型值可以是 xml、 html、 json 或 text。
To tell the Agent how to parse the content, specify extract
as a hash with keys naming the extractions and values of hashes.
若要告诉代理如何解析内容,请将提取指定为散列,并使用键命名散列的提取和值。
Note that for all of the formats, whatever you extract MUST have the same number of matches for each extractor except when it has repeat
set to true. E.g., if you’re extracting rows, all extractors must match all rows. For generating CSS selectors, something like SelectorGadget may be helpful.
请注意,对于所有的格式,无论您提取什么,每个提取器都必须具有相同数量的匹配,除非它将 repeat 设置为 true。例如,如果要提取行,则所有提取器必须匹配所有行。对于生成 CSS 选择器,像 SelectorGadget 这样的东西可能会很有帮助。
For extractors with hidden
set to true, they will be excluded from the payloads of events created by the Agent, but can be used and interpolated in the template
option explained below.
对于隐藏设置为 true 的提取器,它们将被排除在 Agent 创建的事件的有效负载之外,但是可以在下面解释的模板选项中使用和插值。
For extractors with repeat
set to true, their first matches will be included in all extracts. This is useful such as when you want to include the title of a page in all events created from the page.
对于将 repeat 设置为 true 的提取器,它们的第一个匹配将包括在所有提取中。这非常有用,例如当您希望在从页面创建的所有事件中包含页面的标题时。
Scraping HTML and XML
抓取 HTML 和 XML
When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in css
or an XPath expression in xpath
. It then evaluates an XPath expression in value
(default: .
) on each node in the node set, converting the result into a string. Here’s an example:
在解析 HTML 或 XML 时,这些子哈希指定每个提取应该如何进行。Agent 首先通过计算 CSS 中的 CSS 选择器或 XPath 中的 XPath 表达式,从文档中为每个提取键选择一个节点集。然后计算 XPath 表达式的值(缺省值: .)在节点集中的每个节点上,将结果转换为字符串。这里有一个例子:
"extract": {
"url": { "css": "#comic img", "value": "@src" },
"title": { "css": "#comic img", "value": "@title" },
"body_text": { "css": "div.main", "value": "string(.)" },
"page_title": { "css": "title", "value": "string(.)", "repeat": true }
} or
"extract": {
"url": { "xpath": "//*[@class="blog-item"]/a/@href", "value": "."
"title": { "xpath": "//*[@class="blog-item"]/a", "value": "normalize-space(.)" },
"description": { "xpath": "//*[@class="blog-item"]/div[0]", "value": "string(.)" }
}
“@attr” is the XPath expression to extract the value of an attribute named attr from a node (such as “@href” from a hyperlink), and string(.)
gives a string with all the enclosed text nodes concatenated without entity escaping (such as &
). To extract the innerHTML, use ./node()
; and to extract the outer HTML, use .
.
“@attr”是一个 XPath 表达式,用于从节点(如超链接中的“@href”)和 string (.)中提取名为 attr 的属性的值给出一个字符串,其中所有封闭的文本节点连接在一起,没有实体转义(例如 &)。要提取 innerHTML,请使用。/node () ; 要提取外部 HTML,请使用。.
You can also use XPath functions like normalize-space
to strip and squeeze whitespace, substring-after
to extract part of a text, and translate
to remove commas from formatted numbers, etc. Instead of passing string(.)
to these functions, you can just pass .
like normalize-space(.)
and translate(., ',', '')
.
您还可以使用 XPath 函数,比如 norize-space 来去除和压缩空格,substring-after 来提取文本的一部分,翻译来删除格式化数字中的逗号,等等。而不是传递 string (.)传递给这些函数。比如标准化空间(.)翻译。,',','').
Beware that when parsing an XML document (i.e. type
is xml
) using xpath
expressions, all namespaces are stripped from the document unless the top-level option use_namespaces
is set to true
.
请注意,当使用 xpath 表达式解析 XML 文档(即 type is XML)时,除非顶级选项 use _ nampace 设置为 true,否则将从文档中剥离所有名称空间。
For extraction with array
set to true, all matches will be extracted into an array. This is useful when extracting list elements or multiple parts of a website that can only be matched with the same selector.
对于将数组设置为 true 的提取,所有匹配都将提取到一个数组中。这在提取列表元素或网站的多个部分时非常有用,因为这些元素只能与同一个选择器匹配。
Scraping JSON
擦除 JSON
When parsing JSON, these sub-hashes specify JSONPaths to the values that you care about.
在解析 JSON 时,这些子哈希将 JSONPath 指定给您关心的值。
Sample incoming event:
传入事件样本:
{ "results": {
"data": [
{
"title": "Lorem ipsum 1",
"description": "Aliquam pharetra leo ipsum."
"price": 8.95
},
{
"title": "Lorem ipsum 2",
"description": "Suspendisse a pulvinar lacus."
"price": 12.99
},
{
"title": "Lorem ipsum 3",
"description": "Praesent ac arcu tellus."
"price": 8.99
}
]
}
}
Sample rule:
示例规则:
"extract": {
"title": { "path": "results.data[*].title" },
"description": { "path": "results.data[*].description" }
}
In this example the *
wildcard character makes the parser to iterate through all items of the data
array. Three events will be created as a result.
在这个例子中,* 通配符让解析器遍历数据数组的所有项目。结果将创建三个事件。
Sample outgoing events:
外发事件样本:
[
{
"title": "Lorem ipsum 1",
"description": "Aliquam pharetra leo ipsum."
},
{
"title": "Lorem ipsum 2",
"description": "Suspendisse a pulvinar lacus."
},
{
"title": "Lorem ipsum 3",
"description": "Praesent ac arcu tellus."
}
]
The extract
option can be skipped for the JSON type, causing the full JSON response to be returned.
可以跳过 JSON 类型的提取选项,从而返回完整的 JSON 响应。
Scraping Text
刮取文字
When parsing text, each sub-hash should contain a regexp
and index
. Output text is matched against the regular expression repeatedly from the beginning through to the end, collecting a captured group specified by index
in each match. Each index should be either an integer or a string name which corresponds to (?<*name*>...)
. For example, to parse lines of *word*: *definition*
, the following should work:
解析文本时,每个子散列应包含一个 regexp 和索引。输出文本从头到尾反复地与正则表达式进行匹配,收集在每次匹配中由索引指定的捕获的组。每个索引要么是一个整数,要么是一个对应于(?< 名称 > ...)。例如,要解析 word: Definition 的行,应该使用以下代码:
"extract": {
"word": { "regexp": "^(.+?): (.+)$", "index": 1 },
"definition": { "regexp": "^(.+?): (.+)$", "index": 2 }
}
Or if you prefer names to numbers for index:
或者如果你喜欢名字而不喜欢数字作为索引:
"extract": {
"word": { "regexp": "^(?<word>.+?): (?<definition>.+)$", "index": "word" },
"definition": { "regexp": "^(?<word>.+?): (?<definition>.+)$", "index": "definition" }
}
To extract the whole content as one event:
将整个内容作为一个事件提取:
"extract": {
"content": { "regexp": "\A(?m:.)*\z", "index": 0 }
}
Beware that .
does not match the newline character (LF) unless the m
flag is in effect, and ^
/$
basically match every line beginning/end. See this document to learn the regular expression variant used in this service.
小心点。不匹配换行符(LF) ,除非 m 标志有效,并且 ^/$基本上匹配每一行开始/结束。请参阅此文档以了解此服务中使用的正则表达式变量。
General Options
一般方案
Can be configured to use HTTP basic auth by including the basic_auth
parameter with "username:password"
, or ["username", "password"]
.
可以通过包含带有“ username: password”或[“ username”,“ password”]的 basic _ auth 参数来配置为使用 HTTP 基本身份验证。
Set expected_update_period_in_days
to the maximum amount of time that you’d expect to pass between Events being created by this Agent. This is only used to set the “working” status.
将 件 _ update _ period _ in _ days 设置为您预期在此代理创建的 Events 之间传递的最大时间量。这只用于设置“工作”状态。
Set uniqueness_look_back
to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of 200 or 3x the number of detected received results.
设置惟一性 _ look _ back 以限制检查唯一性的事件数(通常是为了性能)。默认值为检测到的结果数量的200或3倍。
Set force_encoding
to an encoding name (such as UTF-8
and ISO-8859-1
) if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header. Below are the steps used by Huginn to detect the encoding of fetched content:
如果已知网站在 Content-Type 头中响应缺少、无效或错误的字符集,则将 force _ coding 设置为编码名称(如 UTF-8和 ISO-8859-1)。下面是 Huginn 用来检测获取内容的编码的步骤:
- If 如果
force_encoding
is given, that value is used. 则使用该值 - If the Content-Type header contains a charset parameter, that value is used. 如果 Content-Type 标头包含字符集参数,则使用该值
- When 什么时候
type
is 是html
or 或者xml
, Huginn checks for the presence of a BOM, XML declaration with attribute “encoding”, or an HTML meta tag with charset information, and uses that if found. ,Huginn 检查是否存在 BOM、带有属性“编码”的 XML 声明或带有字符集信息的 HTML 元标记,如果找到,则使用该元标记 - Huginn falls back to UTF-8 (not ISO-8859-1). Huginn 回到 UTF-8(不是 ISO-8859-1)
Set user_agent
to a custom User-Agent name if the website does not like the default value (Huginn - https://github.com/huginn/huginn
).
如果网站不喜欢默认值,则将 user _ agent 设置为自定义 User-Agent 名称(Huginn- https://github.com/Huginn/Huginn )。
The headers
field is optional. When present, it should be a hash of headers to send with the request.
头字段是可选的。当出现时,它应该是一个随请求一起发送的头的散列表。
Set disable_ssl_verification
to true
to disable ssl verification.
如果要禁用 ssl 验证,请将 able _ ssl _ validation 设置为 true。
Set unzip
to gzip
to inflate the resource using gzip.
将 unzip 设置为 gzip 以使用 gzip 充分利用资源。
Set http_success_codes
to an array of status codes (e.g., [404, 422]
) to treat HTTP response codes beyond 200 as successes.
将 HTTP _ Success _ code 设置为一组状态代码(例如[404,422]) ,将超过200个的 HTTP 响应代码视为成功。
If a template
option is given, its value must be a hash, whose key-value pairs are interpolated after extraction for each iteration and merged with the payload. In the template, keys of extracted data can be interpolated, and some additional variables are also available as explained in the next section. For example:
如果给定了模板选项,那么它的值必须是散列,其键值对在每次迭代提取之后进行插值,并与有效负载合并。在模板中,可以对提取数据的键进行插值,还可以使用一些其他变量,如下一节所述。例如:
"template": {
"url": "{{ url | to_uri: _response_.url }}",
"description": "{{ body_text }}",
"last_modified": "{{ _response_.headers.Last-Modified | date: '%FT%T' }}"
}
In the on_change
mode, change is detected based on the resulted event payload after applying this option. If you want to add some keys to each event but ignore any change in them, set mode
to all
and put a DeDuplicationAgent downstream.
在 on _ change 模式下,应用此选项后,将根据所产生的事件有效负载检测更改。如果希望为每个事件添加一些键,但忽略其中的任何更改,请将模式设置为 all 并将 DeDuplicationAgent 放在下游。
Liquid Templating
液体模板
In Liquid templating, the following variables are available:
在液体模板中,下列变量是可用的:
-
_url_
: The URL specified to fetch the content from. When parsingdata_from_event
, this is not set._ URL _: 指定从中获取内容的 URL。在解析 data _ from _ event 时,未设置此值。
-
_response_
: A response object with the following keys:_ response _: 具有以下键的响应对象:
-
status
: HTTP status as integer. (Almost always 200) When parsingdata_from_event
, this is set to the value of thestatus
key in the incoming Event, if it is a number or a string convertible to an integer.状态: HTTP 状态为整数。(几乎总是200)在解析 data _ from _ Event 时,如果它是一个可转换为整数的数字或字符串,则将其设置为传入 Event 中的状态键的值。
-
headers
: Response headers; for example,{{ _response_.headers.Content-Type }}
expands to the value of the Content-Type header. Keys are insensitive to cases and -/_. When parsingdata_from_event
, this is constructed from the value of theheaders
key in the incoming Event, if it is a hash.Header: 响应 Header; 例如,{{ _ Response 。标题。{ Content-Type }}扩展为 Content-Type 标头的值。键对大小写和-/ 不敏感。在解析 data _ from _ Event 时,如果它是一个散列,那么它是由传入 Event 中的 Header 键的值构造的。
-
url
: The final URL of the fetched page, following redirects. When parsingdata_from_event
, this is set to the value of theurl
key in the incoming Event. Using this in thetemplate
option, you can resolve relative URLs extracted from a document like{{ link | to_uri: _response_.url }}
and{{ content | rebase_hrefs: _response_.url }}
.URL: 获取的页面的最终 URL,重定向之后。在解析 data _ from _ Event 时,这被设置为传入 Event 中的 url 键的值。在模板选项中使用此选项,可以解析从文档(如{{ link | to _ uri: _ response _)提取的相对 URL。}和{{ content | rebase _ hrefs: _ response _。你好。
-
Ordering Events
订购活动
To specify the order of events created in each run, set events_order
to an array of sort keys, each of which looks like either expression
or [expression, type, descending]
, as described as follows:
要指定每次运行中创建的事件顺序,请将 events _ order 设置为一个排序键数组,每个键看起来都像表达式或[ expression,type, 下降] ,如下所述:
-
expression is a Liquid template to generate a string to be used as sort key.
表达式是一个用于生成用作排序键的字符串的液体模板。
-
type (optional) is one of
string
(default),number
andtime
, which specifies how to evaluate expression for comparison.Type (可选)是字符串(缺省值)、数字和时间之一,它指定如何计算表达式以进行比较。
-
descending (optional) is a boolean value to determine if comparison should be done in descending (reverse) order, which defaults to
false
.降序(可选)是一个布尔值,用于确定比较是否应该按降序(反向)进行,默认为 false。
Sort keys listed earlier take precedence over ones listed later. For example, if you want to sort articles by the date and then by the author, specify [["{{date}}", "time"], "{{author}}"]
.
前面列出的排序键优先于后面列出的排序键。例如,如果希望按日期然后按作者对文章进行排序,请指定[[“{{ date }”,“ time”] ,“{{ author }}”]。
Sorting is done stably, so even if all events have the same set of sort key values the original order is retained. Also, a special Liquid variable _index_
is provided, which contains the zero-based index number of each event, which means you can exactly reverse the order of events by specifying [["{{_index_}}", "number", true]]
.
排序是稳定地进行的,因此即使所有事件具有相同的排序键值集合,也会保留原始顺序。此外,还提供了一个特殊的肃变量 _ index _,它包含每个事件的从零开始的索引号,这意味着您可以通过指定[[“{ _ index _ }”,“ number”,true ]]来精确地逆转事件的顺序。
If the include_sort_info
option is set, each created event will have a sort_info
key whose value is a hash containing the following keys:
如果设置了 include _ sort _ info 选项,那么每个创建的事件都将有一个 sort _ info 键,其值是一个包含以下键的散列值:
position
: 1-based index of each event after the sort : 排序后每个事件的从1开始的索引count
: Total number of events sorted : 已排序的事件总数
The Website Agent scrapes a website, XML document, or JSON feed and creates Events based on the results.
网站代理抓取网站、 XML 文档或 JSON 提要,并根据结果创建 Events。
Specify a url
and select a mode
for when to create Events based on the scraped data, either all
, on_change
, or merge
(if fetching based on an Event, see below).
指定一个 url 并选择一种模式,以便何时基于刮取的数据创建 Events,可以是 all、 on _ change 或 merge (如果基于 Event 进行抓取,请参见下文)。
The url
option can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape).
Url 选项可以是单个 url,也可以是 url 数组(例如,对于具有完全相同的结构但需要刮取的内容不同的多个页面)。
The WebsiteAgent can also scrape based on incoming events.
WebsiteAgent 还可以根据传入的事件进行刮取。
- Set the 设置
url_from_event
option to a 选择Liquid 液体 template to generate the url to access based on the Event. (To fetch the url in the Event’s 模板生成要基于事件访问的 URLurl
key, for example, set 例如,设置url_from_event
to 到{{ url }}
.) - Alternatively, set 或者,设置
data_from_event
to a 到一个Liquid 液体 template to use data directly without fetching any URL. (For example, set it to 模板直接使用数据而无需获取任何 URL。(例如,将其设置为{{ html }}
to use HTML contained in the 中包含的 HTMLhtml
key of the incoming Event.) 传入事件的键。) - If you specify 如果您指定
merge
for the 为了mode
option, Huginn will retain the old payload and update it with new values. 选项,Huginn 将保留旧的有效负载并用新值更新它
Supported Document Types
支持的文档类型
The type
value can be xml
, html
, json
, or text
.
类型值可以是 xml、 html、 json 或 text。
To tell the Agent how to parse the content, specify extract
as a hash with keys naming the extractions and values of hashes.
若要告诉代理如何解析内容,请将提取指定为散列,并使用键命名散列的提取和值。
Note that for all of the formats, whatever you extract MUST have the same number of matches for each extractor except when it has repeat
set to true. E.g., if you’re extracting rows, all extractors must match all rows. For generating CSS selectors, something like SelectorGadget may be helpful.
请注意,对于所有的格式,无论您提取什么,每个提取器都必须具有相同数量的匹配,除非它将 repeat 设置为 true。例如,如果要提取行,则所有提取器必须匹配所有行。对于生成 CSS 选择器,像 SelectorGadget 这样的东西可能会很有帮助。
For extractors with hidden
set to true, they will be excluded from the payloads of events created by the Agent, but can be used and interpolated in the template
option explained below.
对于隐藏设置为 true 的提取器,它们将被排除在 Agent 创建的事件的有效负载之外,但是可以在下面解释的模板选项中使用和插值。
For extractors with repeat
set to true, their first matches will be included in all extracts. This is useful such as when you want to include the title of a page in all events created from the page.
对于将 repeat 设置为 true 的提取器,它们的第一个匹配将包括在所有提取中。这非常有用,例如当您希望在从页面创建的所有事件中包含页面的标题时。
Scraping HTML and XML
抓取 HTML 和 XML
When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in css
or an XPath expression in xpath
. It then evaluates an XPath expression in value
(default: .
) on each node in the node set, converting the result into a string. Here’s an example:
在解析 HTML 或 XML 时,这些子哈希指定每个提取应该如何进行。Agent 首先通过计算 CSS 中的 CSS 选择器或 XPath 中的 XPath 表达式,从文档中为每个提取键选择一个节点集。然后计算 XPath 表达式的值(缺省值: .)在节点集中的每个节点上,将结果转换为字符串。这里有一个例子:
"extract": {
"url": { "css": "#comic img", "value": "@src" },
"title": { "css": "#comic img", "value": "@title" },
"body_text": { "css": "div.main", "value": "string(.)" },
"page_title": { "css": "title", "value": "string(.)", "repeat": true }
} or
"extract": {
"url": { "xpath": "//*[@class="blog-item"]/a/@href", "value": "."
"title": { "xpath": "//*[@class="blog-item"]/a", "value": "normalize-space(.)" },
"description": { "xpath": "//*[@class="blog-item"]/div[0]", "value": "string(.)" }
}
“@attr” is the XPath expression to extract the value of an attribute named attr from a node (such as “@href” from a hyperlink), and string(.)
gives a string with all the enclosed text nodes concatenated without entity escaping (such as &
). To extract the innerHTML, use ./node()
; and to extract the outer HTML, use .
.
“@attr”是一个 XPath 表达式,用于从节点(如超链接中的“@href”)和 string (.)中提取名为 attr 的属性的值给出一个字符串,其中所有封闭的文本节点连接在一起,没有实体转义(例如 &)。要提取 innerHTML,请使用。/node () ; 要提取外部 HTML,请使用。.
You can also use XPath functions like normalize-space
to strip and squeeze whitespace, substring-after
to extract part of a text, and translate
to remove commas from formatted numbers, etc. Instead of passing string(.)
to these functions, you can just pass .
like normalize-space(.)
and translate(., ',', '')
.
您还可以使用 XPath 函数,比如 norize-space 来去除和压缩空格,substring-after 来提取文本的一部分,翻译来删除格式化数字中的逗号,等等。而不是传递 string (.)传递给这些函数。比如标准化空间(.)翻译。,',','').
Beware that when parsing an XML document (i.e. type
is xml
) using xpath
expressions, all namespaces are stripped from the document unless the top-level option use_namespaces
is set to true
.
请注意,当使用 xpath 表达式解析 XML 文档(即 type is XML)时,除非顶级选项 use _ nampace 设置为 true,否则将从文档中剥离所有名称空间。
For extraction with array
set to true, all matches will be extracted into an array. This is useful when extracting list elements or multiple parts of a website that can only be matched with the same selector.
对于将数组设置为 true 的提取,所有匹配都将提取到一个数组中。这在提取列表元素或网站的多个部分时非常有用,因为这些元素只能与同一个选择器匹配。
Scraping JSON
擦除 JSON
When parsing JSON, these sub-hashes specify JSONPaths to the values that you care about.
在解析 JSON 时,这些子哈希将 JSONPath 指定给您关心的值。
Sample incoming event:
传入事件样本:
{ "results": {
"data": [
{
"title": "Lorem ipsum 1",
"description": "Aliquam pharetra leo ipsum."
"price": 8.95
},
{
"title": "Lorem ipsum 2",
"description": "Suspendisse a pulvinar lacus."
"price": 12.99
},
{
"title": "Lorem ipsum 3",
"description": "Praesent ac arcu tellus."
"price": 8.99
}
]
}
}
Sample rule:
示例规则:
"extract": {
"title": { "path": "results.data[*].title" },
"description": { "path": "results.data[*].description" }
}
In this example the *
wildcard character makes the parser to iterate through all items of the data
array. Three events will be created as a result.
在这个例子中,* 通配符让解析器遍历数据数组的所有项目。结果将创建三个事件。
Sample outgoing events:
外发事件样本:
[
{
"title": "Lorem ipsum 1",
"description": "Aliquam pharetra leo ipsum."
},
{
"title": "Lorem ipsum 2",
"description": "Suspendisse a pulvinar lacus."
},
{
"title": "Lorem ipsum 3",
"description": "Praesent ac arcu tellus."
}
]
The extract
option can be skipped for the JSON type, causing the full JSON response to be returned.
可以跳过 JSON 类型的提取选项,从而返回完整的 JSON 响应。
Scraping Text
刮取文字
When parsing text, each sub-hash should contain a regexp
and index
. Output text is matched against the regular expression repeatedly from the beginning through to the end, collecting a captured group specified by index
in each match. Each index should be either an integer or a string name which corresponds to (?<*name*>...)
. For example, to parse lines of *word*: *definition*
, the following should work:
解析文本时,每个子散列应包含一个 regexp 和索引。输出文本从头到尾反复地与正则表达式进行匹配,收集在每次匹配中由索引指定的捕获的组。每个索引要么是一个整数,要么是一个对应于(?< 名称 > ...)。例如,要解析 word: Definition 的行,应该使用以下代码:
"extract": {
"word": { "regexp": "^(.+?): (.+)$", "index": 1 },
"definition": { "regexp": "^(.+?): (.+)$", "index": 2 }
}
Or if you prefer names to numbers for index:
或者如果你喜欢名字而不喜欢数字作为索引:
"extract": {
"word": { "regexp": "^(?<word>.+?): (?<definition>.+)$", "index": "word" },
"definition": { "regexp": "^(?<word>.+?): (?<definition>.+)$", "index": "definition" }
}
To extract the whole content as one event:
将整个内容作为一个事件提取:
"extract": {
"content": { "regexp": "\A(?m:.)*\z", "index": 0 }
}
Beware that .
does not match the newline character (LF) unless the m
flag is in effect, and ^
/$
basically match every line beginning/end. See this document to learn the regular expression variant used in this service.
小心点。不匹配换行符(LF) ,除非 m 标志有效,并且 ^/$基本上匹配每一行开始/结束。请参阅此文档以了解此服务中使用的正则表达式变量。
General Options
一般方案
Can be configured to use HTTP basic auth by including the basic_auth
parameter with "username:password"
, or ["username", "password"]
.
可以通过包含带有“ username: password”或[“ username”,“ password”]的 basic _ auth 参数来配置为使用 HTTP 基本身份验证。
Set expected_update_period_in_days
to the maximum amount of time that you’d expect to pass between Events being created by this Agent. This is only used to set the “working” status.
将 件 _ update _ period _ in _ days 设置为您预期在此代理创建的 Events 之间传递的最大时间量。这只用于设置“工作”状态。
Set uniqueness_look_back
to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of 200 or 3x the number of detected received results.
设置惟一性 _ look _ back 以限制检查唯一性的事件数(通常是为了性能)。默认值为检测到的结果数量的200或3倍。
Set force_encoding
to an encoding name (such as UTF-8
and ISO-8859-1
) if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header. Below are the steps used by Huginn to detect the encoding of fetched content:
如果已知网站在 Content-Type 头中响应缺少、无效或错误的字符集,则将 force _ coding 设置为编码名称(如 UTF-8和 ISO-8859-1)。下面是 Huginn 用来检测获取内容的编码的步骤:
- If 如果
force_encoding
is given, that value is used. 则使用该值 - If the Content-Type header contains a charset parameter, that value is used. 如果 Content-Type 标头包含字符集参数,则使用该值
- When 什么时候
type
is 是html
or 或者xml
, Huginn checks for the presence of a BOM, XML declaration with attribute “encoding”, or an HTML meta tag with charset information, and uses that if found. ,Huginn 检查是否存在 BOM、带有属性“编码”的 XML 声明或带有字符集信息的 HTML 元标记,如果找到,则使用该元标记 - Huginn falls back to UTF-8 (not ISO-8859-1). Huginn 回到 UTF-8(不是 ISO-8859-1)
Set user_agent
to a custom User-Agent name if the website does not like the default value (Huginn - https://github.com/huginn/huginn
).
如果网站不喜欢默认值,则将 user _ agent 设置为自定义 User-Agent 名称(Huginn- https://github.com/Huginn/Huginn )。
The headers
field is optional. When present, it should be a hash of headers to send with the request.
头字段是可选的。当出现时,它应该是一个随请求一起发送的头的散列表。
Set disable_ssl_verification
to true
to disable ssl verification.
如果要禁用 ssl 验证,请将 able _ ssl _ validation 设置为 true。
Set unzip
to gzip
to inflate the resource using gzip.
将 unzip 设置为 gzip 以使用 gzip 充分利用资源。
Set http_success_codes
to an array of status codes (e.g., [404, 422]
) to treat HTTP response codes beyond 200 as successes.
将 HTTP _ Success _ code 设置为一组状态代码(例如[404,422]) ,将超过200个的 HTTP 响应代码视为成功。
If a template
option is given, its value must be a hash, whose key-value pairs are interpolated after extraction for each iteration and merged with the payload. In the template, keys of extracted data can be interpolated, and some additional variables are also available as explained in the next section. For example:
如果给定了模板选项,那么它的值必须是散列,其键值对在每次迭代提取之后进行插值,并与有效负载合并。在模板中,可以对提取数据的键进行插值,还可以使用一些其他变量,如下一节所述。例如:
"template": {
"url": "{{ url | to_uri: _response_.url }}",
"description": "{{ body_text }}",
"last_modified": "{{ _response_.headers.Last-Modified | date: '%FT%T' }}"
}
In the on_change
mode, change is detected based on the resulted event payload after applying this option. If you want to add some keys to each event but ignore any change in them, set mode
to all
and put a DeDuplicationAgent downstream.
在 on _ change 模式下,应用此选项后,将根据所产生的事件有效负载检测更改。如果希望为每个事件添加一些键,但忽略其中的任何更改,请将模式设置为 all 并将 DeDuplicationAgent 放在下游。
Liquid Templating
液体模板
In Liquid templating, the following variables are available:
在液体模板中,下列变量是可用的:
-
_url_
: The URL specified to fetch the content from. When parsingdata_from_event
, this is not set._ URL _: 指定从中获取内容的 URL。在解析 data _ from _ event 时,未设置此值。
-
_response_
: A response object with the following keys:_ response _: 具有以下键的响应对象:
-
status
: HTTP status as integer. (Almost always 200) When parsingdata_from_event
, this is set to the value of thestatus
key in the incoming Event, if it is a number or a string convertible to an integer.状态: HTTP 状态为整数。(几乎总是200)在解析 data _ from _ Event 时,如果它是一个可转换为整数的数字或字符串,则将其设置为传入 Event 中的状态键的值。
-
headers
: Response headers; for example,{{ _response_.headers.Content-Type }}
expands to the value of the Content-Type header. Keys are insensitive to cases and -/_. When parsingdata_from_event
, this is constructed from the value of theheaders
key in the incoming Event, if it is a hash.Header: 响应 Header; 例如,{{ _ Response 。标题。{ Content-Type }}扩展为 Content-Type 标头的值。键对大小写和-/ 不敏感。在解析 data _ from _ Event 时,如果它是一个散列,那么它是由传入 Event 中的 Header 键的值构造的。
-
url
: The final URL of the fetched page, following redirects. When parsingdata_from_event
, this is set to the value of theurl
key in the incoming Event. Using this in thetemplate
option, you can resolve relative URLs extracted from a document like{{ link | to_uri: _response_.url }}
and{{ content | rebase_hrefs: _response_.url }}
.URL: 获取的页面的最终 URL,重定向之后。在解析 data _ from _ Event 时,这被设置为传入 Event 中的 url 键的值。在模板选项中使用此选项,可以解析从文档(如{{ link | to _ uri: _ response _)提取的相对 URL。}和{{ content | rebase _ hrefs: _ response _。你好。
-
Ordering Events
订购活动
To specify the order of events created in each run, set events_order
to an array of sort keys, each of which looks like either expression
or [expression, type, descending]
, as described as follows:
要指定每次运行中创建的事件顺序,请将 events _ order 设置为一个排序键数组,每个键看起来都像表达式或[ expression,type, 下降] ,如下所述:
-
expression is a Liquid template to generate a string to be used as sort key.
表达式是一个用于生成用作排序键的字符串的液体模板。
-
type (optional) is one of
string
(default),number
andtime
, which specifies how to evaluate expression for comparison.Type (可选)是字符串(缺省值)、数字和时间之一,它指定如何计算表达式以进行比较。
-
descending (optional) is a boolean value to determine if comparison should be done in descending (reverse) order, which defaults to
false
.降序(可选)是一个布尔值,用于确定比较是否应该按降序(反向)进行,默认为 false。
Sort keys listed earlier take precedence over ones listed later. For example, if you want to sort articles by the date and then by the author, specify [["{{date}}", "time"], "{{author}}"]
.
前面列出的排序键优先于后面列出的排序键。例如,如果希望按日期然后按作者对文章进行排序,请指定[[“{{ date }”,“ time”] ,“{{ author }}”]。
Sorting is done stably, so even if all events have the same set of sort key values the original order is retained. Also, a special Liquid variable _index_
is provided, which contains the zero-based index number of each event, which means you can exactly reverse the order of events by specifying [["{{_index_}}", "number", true]]
.
排序是稳定地进行的,因此即使所有事件具有相同的排序键值集合,也会保留原始顺序。此外,还提供了一个特殊的肃变量 _ index _,它包含每个事件的从零开始的索引号,这意味着您可以通过指定[[“{ _ index _ }”,“ number”,true ]]来精确地逆转事件的顺序。
If the include_sort_info
option is set, each created event will have a sort_info
key whose value is a hash containing the following keys:
如果设置了 include _ sort _ info 选项,那么每个创建的事件都将有一个 sort _ info 键,其值是一个包含以下键的散列值:
position
: 1-based index of each event after the sort : 排序后每个事件的从1开始的索引count
: Total number of events sorted : 已排序的事件总数
Website Metadata Agent
The WebsiteMetadata Agent extracts metadata from HTML. It supports schema.org microdata, embedded JSON-LD and the common meta tag attributes.
WebsiteMetadata Agent 从 HTML 中提取元数据,它支持 schema.org 微数据、嵌入式 JSON-LD 和通用元标记属性。
data
HTML to use in the extraction process, use Liquid formatting to select data from incoming events.
数据 HTML 中使用的提取过程中,使用液体格式选择数据从传入的事件。
url
optionally set the source URL of the provided HTML (without an URL schema.org links can not be extracted properly)
URL 可以选择设置提供的 HTML 的源 URL (没有 URL schema.org 链接就不能正确提取)
result_key
sets the key which contains the the extracted information.
Result _ key 设置包含提取信息的键。
merge
set to true to retain the received payload and update it with the extracted result
Merge 设置为 true 以保留接收到的有效负载并用提取的结果更新它
Liquid formatting can be used in all options.
可以在所有选项中使用液态格式。