Monitoring rules reference(监控规则参考)¶
Monitoring rules are configured on a per-resource basis, with rules for the following resources:
- Agents
- Schedules
- Objects and links
- Streaming datasets
- Live deployments
- Time series syncs
- Geotemporal observations
- Automations
- Datasets
- Functions
- Actions
All monitoring rules contain a configurable field called Alert severity which is the severity granted to an alert when its condition is triggered. Monitoring views can be configured to only send out alerts that meet or exceed a certain severity.
| Rule component | Description | Example options |
|---|---|---|
| Alert severity | Severity of monitoring report condition | Low, Medium, High |
Agent rules¶
Agent last heartbeat time¶
Alerts when the agent bootstrapper's last heartbeat is older than a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Amount of time elapsed since the last heartbeat received from the agent bootstrapper | 10 minutes |
We recommend setting this monitor value to 10 minutes.
Agent manager version stale time¶
Alerts when the agent bootstrapper version has not been upgraded since a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Amount of time elapsed since the agent manager has been on an old version | 10 minutes |
We recommend setting this monitor value to 10 days.
Agent version stale time¶
Alerts when the agent version has not been upgraded since a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Amount of time elapsed since the agent has been on an older version | 10 days |
We recommend setting this monitor value to 10 days.
High CPU utilization¶
Alerts when the agent CPU utilization exceeds a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Percentage of CPU utilization | 80 |
We recommend setting this monitor value to 80 (%).
JVM heap usage is close to the limit¶
Alerts when the JVM heap usage exceeds a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Percentage of JVM heap used / JVM heap available | 70 |
We recommend setting this monitor value to 70 (%).
Low disk space¶
Alerts when the available disk space drops below a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is less than | Available disk space | 10GB |
We recommend setting this monitor value to 10GB.
Time until earliest keystore certificate expires¶
Alerts when a certificate in the agent's keystore will expire within a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is less than | Amount of time until a certificate expires | 10 days |
** We recommend setting this monitor value to medium severity at less than 30 days and high severity at less than 10 days.**
Time until earliest truststore certificate expires¶
Alerts when a certificate in the agent's truststore will expire within a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is less than | Amount of time until a certificate expires | 10 days |
We recommend setting this monitor value to medium severity at less than 30 days and high severity at less than 10 days.
Queue size¶
Alerts when the number of jobs queued on an agent exceeds a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | The number of jobs in the agent's job queue | 70 |
We recommend setting this monitor value to 70 (jobs).
Schedule rules¶
Consecutive schedule failures¶
Alerts when the number of consecutive schedule failures meets or exceeds a set threshold. This does not count schedule runs that result in a cancelled build.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of consecutive schedule failures | 1 |
The default behavior for this monitor is to alert with medium severity at one failure and high severity at three failures, though these thresholds are highly dependent on the frequency and stability of the schedules that are included in the monitoring rule's scope.
Schedule duration¶
Alerts when a schedule is running longer than a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | The duration of the schedule | 2 hours |
| This monitor is typically used on highly critical schedules to quickly inform whether or not the schedule will complete in the expected time. Due to the variable nature of schedules, this monitor is often schedule-scoped. |
Object and link rules¶
:::callout{theme="neutral"} A user-caused failure is a job failure that results from a problem with the configuration or input data, such as an invalid schema, a malformed row, or insufficient permissions. These failures are distinguished from transient or infrastructure-related failures, which are not surfaced by these alerts because they are typically resolved automatically by retries. :::
Changelog jobs failing¶
Alerts when the "changelog" job for the object or link is failing on either the active pipeline or the replacement pipeline. Only alerts on user-caused failures.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of consecutive user-caused changelog job failures | 1 |
The default behavior for this monitor is to alert with medium severity at one failure and high severity at three failures.
Merge changes job failing¶
Alerts when the "merge changes" job for the object or link is failing on either the active pipeline or the replacement pipeline. Only alerts on user-caused failures.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of consecutive user-caused merge job failures | 1 |
The default behavior for this monitor is to alert with medium severity at one failure and high severity at three failures.
Sync jobs failing¶
Alerts when the "sync" job for the object or link is failing on either the active pipeline or the replacement pipeline.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of consecutive sync job failures | 3 |
The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures.
Scroll job failing on pipeline¶
Alerts when the "scroll" job for the object or link's active or replacement pipeline is failing. Scroll jobs are responsible for streaming data from the backing datasource to the object databases.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of consecutive scroll job failures | 3 |
The default behavior for this monitor is to alert with low severity at one failure, medium severity at three failures, and high severity at seven failures, and these values are configurable.
Sync propagation delay¶
Alerts when a dataset backing the object has a transaction with a sync time that exceeds a set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of time taken to sync a transaction | 1 day |
Invalid stream records detected¶
Alerts when records in an input stream contain format violations. The scroll job ignores these records. This rule is non-configurable, alerting with critical severity when the number of ignored rows is greater than or equal to one.
Streaming dataset rules¶
Derived stream monitors¶
Last checkpoint duration¶
Alerts if the last checkpoint took more time than the configured threshold to complete.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of time taken to checkpoint | 10 minutes |
Liveness: time since last successful checkpoint¶
Alerts if the stream has not completed a checkpoint since the configured threshold. The default threshold configuration is 2 minutes. This monitor encompasses streams that are not running as well as streams failing a checkpoint.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of time elapsed since last checkpoint | 2 minutes |
Total lag¶
Alerts if a stream's lag (total unprocessed upstream records) exceeds the set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of unprocessed upstream records | 1000 |
This monitor indicates that streaming transforms are taking too long to run, or there is a problem with the streaming transforms infrastructure.
Total throughput¶
Alerts if a stream's throughput (records processed per checkpoints) falls below the set threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is less than | Threshold of records processed per checkpoint | 100 |
This monitor indicates that streaming transforms are taking too long to run, or there is a problem with the streaming transforms infrastructure.
Ingest stream monitors¶
Records ingested over last 5 minutes / 30 minutes / 1 hour / 4 hours / 1 day¶
Alerts if the number of records ingested into the raw stream's live view over the selected time window was less than or equal to the configured threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is less than or equal to | Threshold of ingested records per unit time | 100 |
Live deployment rules¶
Live deployment heartbeat¶
Alerts when deployment has not emitted a heartbeat for more than one minute.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of time elapsed since last heartbeat | 1 minute |
Time series sync rules¶
Points written by the time series sync over last 5 or 30 minutes¶
Alerts if the number of points written by the time series sync over the last 5 or 30 minute window was less than or equal to the configured threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is less than or equal to | Threshold of points written per unit time | 100 |
Dataset rules¶
Time since job last succeeded¶
Alerts when a job on a dataset has not succeeded within a specified time threshold. Unlike the "Time since last updated" health check, the following conditions will always count as a passing status for the monitor:
- The job succeeded, but the transaction was aborted
- The job succeeded, but no new data was added
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Amount of time elapsed since the dataset was last updated | 1 day |
We recommend setting this monitor value based on your dataset's expected update frequency. For daily updates, set it to 1 day.
Geotemporal observation rules¶
Geotemporal observations sent over last 5 or 30 minutes¶
Alerts if the number of geotemporal observations sent over the last 5 or 30 minute window was less than or equal to the configured threshold.
| Rule component | Description | Example options |
|---|---|---|
| If value is less than or equal to | Threshold of geotemporal observations sent per unit time | 100 |
Automation rules¶
The following rules apply to both automations and time series streaming automations.
Automation has no new evaluations¶
Alerts if there has been no new evaluation since the configured threshold. Use this rule to detect performance degradation in an automation that should have been evaluated but did not. This rule does not alert when the automation has not been triggered.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of time elapsed since last automation evaluation | 1 hour |
Automation has no new triggers¶
Alerts if there have been no new monitor triggers within the configured threshold. Use this rule to detect when an automation is not being triggered as expected.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than or equal to | Threshold of time elapsed since last automation trigger | 1 day |
The single latest event exceeded the automation's failure threshold¶
Alerts if the single most recent execution had at least the configured number of failures. This does not take into account multiple events.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of number of failures in most recent automation execution | 10 |
Automation has been disabled by the system¶
Alerts if an automation was disabled by the system due to reaching limits or triggering cycles. This rule is non-configurable, alerting with high severity when the automation is disabled.
Automation had repeated execution failures in a window¶
Alerts when the number of failed automation executions in the window exceeds the configured threshold. Use this rule to surface automations that keep running and failing rather than ones the system has already disabled.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of number of failed executions | 0 |
| Time window | The time period to count failed executions in | 1 hour |
Automation had repeated evaluation failures in a window¶
Alerts when the number of failed automation evaluations in the window exceeds the configured threshold. Use this rule to catch automations whose trigger conditions fail to evaluate, separately from failures that occur during execution.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of number of failed evaluations | 0 |
| Time window | The time period to count failed evaluations in | 1 hour |
Automation had a high number of effect execution failures in a window¶
Alerts when the number of failed effect executions in the window exceeds the configured threshold. Effects are the downstream actions or notifications the automation runs. Use this rule to catch automations whose effects fail even when their triggers and evaluations succeed.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of number of failed effect executions | 0 |
| Time window | The time period to count failed effect executions in | 1 hour |
Function rules¶
Function executions can fail for a variety of reasons. For a full list of failure types, see function failure types.
The Number of function failures in window rule tracks all failure types. The Number of user-facing function failures in window rule tracks only user-facing errors. The Number of non-user-facing function failures in window rule tracks all failure types except user-facing errors.
Function duration p95¶
Alerts when the p95 function duration exceeds the specified thresholds. The p95 is measured over a sliding window of recent data.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of duration | 10s |
Number of function failures in window¶
Alerts when the total number of failed executions of a function in the given window exceeds a given threshold. This rule tracks all failure types, including both user-facing and non-user-facing errors.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of number of failures | 0 |
| Time window | The time period to count failures in | 1 hour |
Number of user-facing function failures in window¶
Alerts when the number of user-facing function failures over a given window exceeds the specified thresholds. This rule tracks only user-facing errors thrown by function code.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of number of failures | 0 |
| Time window | The time period to count failures in | 1 hour |
Number of non-user-facing function failures in window¶
Alerts when the number of function failures over a given window exceeds the specified thresholds, excluding user-facing errors thrown by function code. This rule is useful for monitoring infrastructure and system-level failures without noise from expected user input errors.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of number of failures | 0 |
| Time window | The time period to count failures in | 1 hour |
Action rules¶
Action executions can fail for a variety of reasons. For a full list of failure types, see action failure types.
The Number of action failures in window rule tracks all failure types. The Number of non-user-facing action failures in window rule tracks all failure types except user-facing function failures.
Action duration p95¶
Alerts when the p95 action duration exceeds the specified thresholds. The p95 is measured over a sliding window of recent data.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of duration | 10s |
Number of action failures in window¶
Alerts when the total number of failed executions of an action in the given window exceeds a given threshold. This rule tracks all failure types, including both user-facing and non-user-facing errors.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of number of failures | 0 |
| Time window | The time period to count failures in | 1 hour |
Number of non-user-facing action failures in window¶
Alerts when the number of action failures over a given window exceeds the specified thresholds, excluding failures caused by user-facing errors thrown by function-backed action code. This rule is useful for monitoring infrastructure and system-level failures without noise from expected user input errors.
This rule tracks all failure types except user-facing function failures thrown by function code and displayed to users.
| Rule component | Description | Example options |
|---|---|---|
| If value is greater than | Threshold of number of failures | 0 |
| Time window | The time period to count failures in | 1 hour |
中文翻译¶
监控规则参考¶
监控规则基于每个资源进行配置,涵盖以下资源的规则:
所有监控规则都包含一个可配置字段,称为告警严重级别(Alert severity),当条件触发时,该字段决定告警的严重程度。监控视图(Monitoring views)可以配置为仅发送达到或超过特定严重级别的告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 告警严重级别 | 监控报告条件的严重级别 | 低、中、高 |
Agent规则¶
Agent最后心跳时间¶
当Agent引导程序(agent bootstrapper)的最后心跳时间超过设定阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 自Agent引导程序收到最后一次心跳以来经过的时间 | 10分钟 |
建议将此监控值设置为10分钟。
Agent管理器版本过期时间¶
当Agent引导程序版本在设定阈值时间内未升级时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 自Agent管理器使用旧版本以来经过的时间 | 10分钟 |
建议将此监控值设置为10天。
Agent版本过期时间¶
当Agent版本在设定阈值时间内未升级时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 自Agent使用旧版本以来经过的时间 | 10天 |
建议将此监控值设置为10天。
CPU利用率过高¶
当Agent CPU利用率超过设定阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | CPU利用率百分比 | 80 |
建议将此监控值设置为80(%)。
JVM堆内存使用接近上限¶
当JVM堆内存使用率超过设定阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | JVM已用堆内存 / JVM可用堆内存的百分比 | 70 |
建议将此监控值设置为70(%)。
磁盘空间不足¶
当可用磁盘空间低于设定阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值小于 | 可用磁盘空间 | 10GB |
建议将此监控值设置为10GB。
最早到期的密钥库证书剩余时间¶
当Agent密钥库(keystore)中的证书将在设定阈值时间内过期时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值小于 | 证书到期前的剩余时间 | 10天 |
建议将此监控值设置为:少于30天时触发中等级别告警,少于10天时触发高等级别告警。
最早到期的信任库证书剩余时间¶
当Agent信任库(truststore)中的证书将在设定阈值时间内过期时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值小于 | 证书到期前的剩余时间 | 10天 |
建议将此监控值设置为:少于30天时触发中等级别告警,少于10天时触发高等级别告警。
队列大小¶
当Agent上排队的作业数量超过设定阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | Agent作业队列中的作业数量 | 70 |
建议将此监控值设置为70(个作业)。
调度规则¶
连续调度失败¶
当连续调度失败次数达到或超过设定阈值时触发告警。这不包括导致构建取消的调度运行。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 连续调度失败的阈值 | 1 |
此监控的默认行为是:1次失败触发中等级别告警,3次失败触发高等级别告警,但这些阈值高度依赖于监控规则范围内所包含调度的频率和稳定性。
调度持续时间¶
当调度运行时间超过设定阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 调度的持续时间 | 2小时 |
此监控通常用于高度关键的调度,以便快速了解调度是否能在预期时间内完成。由于调度的可变性,此监控通常按调度范围进行设置。
对象和链接规则¶
:::callout{theme="neutral"} 用户导致的失败(user-caused failure)是指由于配置或输入数据问题导致的作业失败,例如无效的模式(schema)、格式错误的行或权限不足。这些失败与瞬时性或基础设施相关的失败不同,后者不会通过此类告警上报,因为它们通常可以通过重试自动解决。 :::
变更日志作业失败¶
当对象或链接的"变更日志"作业在活动管道或替换管道上失败时触发告警。仅针对用户导致的失败触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 连续用户导致的变更日志作业失败的阈值 | 1 |
此监控的默认行为是:1次失败触发中等级别告警,3次失败触发高等级别告警。
合并变更作业失败¶
当对象或链接的"合并变更"作业在活动管道或替换管道上失败时触发告警。仅针对用户导致的失败触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 连续用户导致的合并作业失败的阈值 | 1 |
此监控的默认行为是:1次失败触发中等级别告警,3次失败触发高等级别告警。
同步作业失败¶
当对象或链接的"同步"作业在活动管道或替换管道上失败时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 连续同步作业失败的阈值 | 3 |
此监控的默认行为是:1次失败触发低等级别告警,3次失败触发中等级别告警,7次失败触发高等级别告警。
管道滚动作业失败¶
当对象或链接的活动管道或替换管道的"滚动"作业失败时触发告警。滚动作业负责将数据从后端数据源流式传输到对象数据库。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 连续滚动作业失败的阈值 | 3 |
此监控的默认行为是:1次失败触发低等级别告警,3次失败触发中等级别告警,7次失败触发高等级别告警,这些值是可配置的。
同步传播延迟¶
当支持对象的数据集存在同步时间超过设定阈值的事务时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 同步事务所花费时间的阈值 | 1天 |
检测到无效流记录¶
当输入流中的记录包含格式违规时触发告警。滚动作业会忽略这些记录。此规则不可配置,当忽略的行数大于或等于1时,以严重级别触发告警。
流式数据集规则¶
派生流监控¶
最后检查点持续时间¶
当最后一个检查点完成所花费的时间超过配置阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 检查点所花费时间的阈值 | 10分钟 |
活跃度:自上次成功检查点以来的时间¶
当流自配置阈值时间内未完成检查点时触发告警。默认阈值配置为2分钟。此监控涵盖未运行的流以及检查点失败的流。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 自上次检查点以来经过的时间阈值 | 2分钟 |
总延迟¶
当流的延迟(未处理的上游记录总数)超过设定阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 未处理的上游记录阈值 | 1000 |
此监控表明流式转换运行时间过长,或流式转换基础设施存在问题。
总吞吐量¶
当流的吞吐量(每个检查点处理的记录数)低于设定阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值小于 | 每个检查点处理的记录数阈值 | 100 |
此监控表明流式转换运行时间过长,或流式转换基础设施存在问题。
摄取流监控¶
过去5分钟/30分钟/1小时/4小时/1天内摄取的记录数¶
当在选定时间窗口内摄取到原始流实时视图中的记录数小于或等于配置阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值小于或等于 | 单位时间内摄取记录数的阈值 | 100 |
实时部署规则¶
实时部署心跳¶
当部署超过一分钟未发送心跳时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 自上次心跳以来经过的时间阈值 | 1分钟 |
时序同步规则¶
时序同步在过去5或30分钟内写入的数据点¶
当时序同步在过去5或30分钟窗口内写入的数据点数量小于或等于配置阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值小于或等于 | 单位时间内写入数据点数的阈值 | 100 |
数据集规则¶
自作业上次成功以来的时间¶
当数据集上的作业在指定时间阈值内未成功时触发告警。与"自上次更新以来的时间"健康检查不同,以下情况始终被视为监控的通过状态:
- 作业成功,但事务被中止
- 作业成功,但未添加新数据
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 自数据集上次更新以来经过的时间 | 1天 |
建议根据数据集的预期更新频率设置此监控值。对于每日更新,设置为1天。
地理时空观测规则¶
过去5或30分钟内发送的地理时空观测¶
当在过去5或30分钟窗口内发送的地理时空观测数量小于或等于配置阈值时触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值小于或等于 | 单位时间内发送的地理时空观测数阈值 | 100 |
自动化规则¶
以下规则适用于自动化和时序流式自动化。
自动化没有新的评估¶
如果自配置阈值以来没有新的评估,则触发告警。使用此规则检测本应被评估但未评估的自动化的性能下降。当自动化未被触发时,此规则不会触发告警。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 自上次自动化评估以来经过的时间阈值 | 1小时 |
自动化没有新的触发器¶
如果在配置阈值内没有新的监控触发器,则触发告警。使用此规则检测自动化未按预期触发的情况。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于或等于 | 自上次自动化触发以来经过的时间阈值 | 1天 |
单个最新事件超过自动化的失败阈值¶
如果最近一次执行至少有配置数量的失败,则触发告警。这不考虑多个事件。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 最近自动化执行中失败次数的阈值 | 10 |
自动化已被系统禁用¶
如果自动化因达到限制或触发循环而被系统禁用,则触发告警。此规则不可配置,当自动化被禁用时以高等级别触发告警。
自动化在窗口内重复执行失败¶
当窗口内自动化执行失败次数超过配置阈值时触发告警。使用此规则发现持续运行并失败的自动化,而非系统已禁用的自动化。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 失败执行次数的阈值 | 0 |
| 时间窗口 | 统计失败执行次数的时间段 | 1小时 |
自动化在窗口内重复评估失败¶
当窗口内自动化评估失败次数超过配置阈值时触发告警。使用此规则捕获触发条件评估失败的自动化,与执行期间发生的失败分开处理。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 失败评估次数的阈值 | 0 |
| 时间窗口 | 统计失败评估次数的时间段 | 1小时 |
自动化在窗口内效果执行失败次数过高¶
当窗口内效果执行失败次数超过配置阈值时触发告警。效果(Effects)是自动化运行的下游操作或通知。使用此规则捕获即使触发器和评估成功但效果失败的自动化。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 失败效果执行次数的阈值 | 0 |
| 时间窗口 | 统计失败效果执行次数的时间段 | 1小时 |
函数规则¶
函数执行可能因多种原因失败。有关失败类型的完整列表,请参见函数失败类型。
窗口内函数失败次数规则跟踪所有失败类型。窗口内面向用户的函数失败次数规则仅跟踪面向用户的错误。窗口内非面向用户的函数失败次数规则跟踪除面向用户错误之外的所有失败类型。
函数持续时间p95¶
当函数持续时间的p95值超过指定阈值时触发告警。p95值基于最近数据的滑动窗口进行测量。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 持续时间的阈值 | 10秒 |
窗口内函数失败次数¶
当给定窗口内函数执行失败的总次数超过给定阈值时触发告警。此规则跟踪所有失败类型,包括面向用户和非面向用户的错误。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 失败次数的阈值 | 0 |
| 时间窗口 | 统计失败次数的时间段 | 1小时 |
窗口内面向用户的函数失败次数¶
当给定窗口内面向用户的函数失败次数超过指定阈值时触发告警。此规则仅跟踪函数代码抛出的面向用户的错误。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 失败次数的阈值 | 0 |
| 时间窗口 | 统计失败次数的时间段 | 1小时 |
窗口内非面向用户的函数失败次数¶
当给定窗口内函数失败次数超过指定阈值时触发告警,排除函数代码抛出的面向用户错误。此规则适用于监控基础设施和系统级失败,而不会受到预期用户输入错误的干扰。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 失败次数的阈值 | 0 |
| 时间窗口 | 统计失败次数的时间段 | 1小时 |
操作规则¶
操作执行可能因多种原因失败。有关失败类型的完整列表,请参见操作失败类型。
窗口内操作失败次数规则跟踪所有失败类型。窗口内非面向用户的操作失败次数规则跟踪除面向用户的函数失败之外的所有失败类型。
操作持续时间p95¶
当操作持续时间的p95值超过指定阈值时触发告警。p95值基于最近数据的滑动窗口进行测量。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 持续时间的阈值 | 10秒 |
窗口内操作失败次数¶
当给定窗口内操作执行失败的总次数超过给定阈值时触发告警。此规则跟踪所有失败类型,包括面向用户和非面向用户的错误。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 失败次数的阈值 | 0 |
| 时间窗口 | 统计失败次数的时间段 | 1小时 |
窗口内非面向用户的操作失败次数¶
当给定窗口内操作失败次数超过指定阈值时触发告警,排除由函数支持的操作代码抛出的面向用户错误导致的失败。此规则适用于监控基础设施和系统级失败,而不会受到预期用户输入错误的干扰。
此规则跟踪所有失败类型除由函数代码抛出并显示给用户的面向用户函数失败之外。
| 规则组件 | 描述 | 示例选项 |
|---|---|---|
| 如果值大于 | 失败次数的阈值 | 0 |
| 时间窗口 | 统计失败次数的时间段 | 1小时 |