跳转至

Checks reference(检查参考)

This page provides more detailed documentation on available health check types.

Category Check type Supported resources
Status Schedule status Datasets
Status Build status Datasets, Iceberg tables, Virtual tables
Status Job status Datasets, Iceberg tables, Virtual tables
Status Sync status Datasets
Time Build duration Datasets
Time Data freshness Datasets
Time Sync duration Datasets
Time Sync freshness Datasets
Time Time since last updated Datasets, Iceberg tables, Virtual tables
Time Time since sync last updated Datasets
Size Dataset file count Datasets
Size Dataset partition Datasets
Size Row count Datasets
Size Transaction file count Datasets
Size Transaction file size Datasets
Content Allowed column values Datasets
Content Approximate unique percentage Datasets
Content Column regex Datasets
Content Approximate column relation Datasets
Content Date range Datasets
Content Null percentage Datasets
Content Numeric mean Datasets
Content Numeric median Datasets
Content Numeric range Datasets
Content Primary key Datasets, Iceberg tables, Virtual tables
Schema Column Datasets, Iceberg tables, Virtual tables
Schema Column count Datasets
Schema Schema Datasets, Iceberg tables, Virtual tables

Status checks

Schedule status

Checks whether the most recent build of the schedule succeeded or failed.

Rule component Description Example options Required?
Severity Severity of check failure Moderate, Critical Y
Escalate Whether to escalate severity after consecutive failures Y, N N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

A schedule status check is representative of the status of the pipeline or set of datasets that always build together. As a result, it will give a status across the various steps leading to the creation or update of this final dataset.

Build status

Checks whether the most recent build of the dataset succeeded or failed.

Rule component Description Example options Required?
Severity Severity of check failure Moderate, Critical Y
Escalate Whether to escalate severity after consecutive failures Y, N N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

A build status check is representative of the status of the whole process leading to a final dataset to be built. As a result, it will give a status across the various steps leading to the creation or update of this final dataset. Note that if the intermediate datasets that are updated or created during the process also have a build status health check, these will not be updated. However, the job status will be updated for all these intermediate datasets.

Job status

Checks whether the most recent job run on a dataset succeeded or failed.

Rule component Description Example options Required?
Severity Severity of check failure Moderate, Critical Y
Escalate Whether to escalate severity after consecutive failures Y, N N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

A job status check triggers independently from the build that causes the dataset to be refreshed or created. In other words, should the concerned dataset be the ultimate output of a given build or not, the job status check will run for each and every build of a particular dataset.

When to use job status, build status, or schedule status checks

In general it is recommended that all schedules have schedule status checks. If your schedule already has a schedule status check, installing job status checks on other datasets built by the same schedule is not recommended, as any job failing on the schedule will trigger a schedule status check.

Use a job status check with intermediate datasets if you want to check whether the dataset got updated, regardless of whether other datasets in the build were successfully updated. If needed, use a build status check if the dataset is a build output and you want to check that the entire build and all datasets, including this dataset, succeeded.

Build status and job status will be equivalent if the dataset is the only output of a build. They may differ if the dataset is an intermediate dataset or if the build has multiple outputs, and the job on the dataset succeeds (or does not run), but other jobs in the build fail and cause the build to fail.

Sync status

Checks whether the most recent sync of the dataset to another database succeeded or failed.

Rule component Description Example options Required?
Sync destination Which sync of the dataset to monitor, relevant especially when the dataset syncs to multiple destinations. phonograph2-cache-worker, jdbc-worker Y
Severity Severity of check failure Moderate, Critical Y
Escalate Whether to escalate severity after consecutive failures Y, N N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Time checks

Build duration

Checks whether the total time a build takes to complete meets some threshold.

Rule component Description Example options Required?
Build duration Total time a build takes to complete (in days, minutes, or hours) Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 N
Median deviation Difference (in approximate standard deviations) from the median time to complete recent builds 1 Standard deviations, 10 Recent builds N
Severity Severity of check failure Moderate, Critical Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

As for the build status check, the build duration check will only be updated for the terminal output of the build. The intermediate datasets that are part of a larger build and have a build duration check attached to them will not be updated.

Data freshness

Checks the time of the latest transaction on a dataset against the maximum value of a timestamp column. If the timestamp in the column represents when the row was added, this can be used to measure exact data freshness.

Rule component Description Example options Required?
Column name Column name of the column containing the time of the last update. LAST_UPDATED Y
Freshness range Time range during which to consider the column's latest data as "fresh" (in days, minutes, or hours) Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 Y
Severity Severity of check failure Moderate, Critical Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Sync duration

Checks whether the total time a sync takes to complete meets some threshold.

Rule component Description Example options Required?
Sync destination Which sync of the dataset to monitor, relevant especially when the dataset syncs to multiple destinations. phonograph2-cache-worker, jdbc-worker Y
Sync duration Total time a sync takes to complete (in days, minutes, or hours) Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 N
Median deviation Difference (in approximate standard deviations) from the median time to complete recent syncs 1 Standard deviations, 10 Recent builds N
Severity Severity of check failure Moderate, Critical Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Sync freshness

Checks the time of the latest sync of a dataset against the maximum value of a datetime column. If the timestamp in the column represents when the row was added, this can be used to measure exact data freshness.

Rule component Description Example options Required?
Column name Column name of the column containing the time of the last update. LAST_UPDATED Y
Freshness range Time range during which to consider the column's latest data as "fresh" (in days, minutes, or hours) Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 Y
Severity Severity of check failure Moderate, Critical Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Time since last updated

Checks whether the total time since the dataset has updated (had a new transaction) meets some threshold.

Rule component Description Example options Required?
Last updated Total time since the dataset has updated (in days, minutes, or hours) Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 N
Median deviation Difference (in approximate standard deviations) from the median update time of recent builds 1 Standard deviations, 10 Recent builds N
Ignore empty transactions Whether to exclude empty transactions when checking time since updated/median deviation. Transactions with no files will be ignored, as if they had not existed Y, N Y
Severity Severity of check failure Moderate, Critical Y
Schedule Schedule check to run automatically or manually Automatic, Custom Schedule Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Time since sync last updated

Checks whether the total time since the dataset last synced to some destination meets some threshold.

Rule component Description Example options Required?
Last sync Total time since the dataset last synced to some destination (in days, minutes, or hours) Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 N
Median deviation Difference (in approximate standard deviations) from the median update time of recent builds 1 Standard deviations, 10 Recent builds N
Severity Severity of check failure Moderate, Critical Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Size checks

Dataset file count

Checks the total number of files in the latest view of the dataset.

Rule component Description Example options Required?
File count Total number of files in the most recent view of a dataset Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 Y
Severity Severity of check failure Moderate, Critical Y
Median deviation Difference (in approximate standard deviations) from the median number of files in recent builds 1 Standard deviations, 10 Recent builds N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Dataset partition

Checks if the partitioning of the dataset is performant.

Rule component Description Example options Required?
Notes The partitioning check works as follows:
- If there are less than 50 files in total, the check always passes.
- If there are 50 or more files in total, the check passes if at least 90% of the files are more than 96MB in size.

If the check fails, it means that the partitioning of the data across files is sub-optimal for performance and the data needs to be partitioned better.
No options to configure N
Issues Automatically create an issue when this check fails Y, N N

Row count

Checks the total number of rows in the dataset.

Rule component Description Example options Required?
Row count Total number of rows in a dataset Between 500 and 1000, Greater than or equal to 100, Less than or equal to 1000, Equal to 10 Y
Severity Severity of check failure Moderate, Critical Y
Median deviation Difference (in approximate standard deviations) from the median row count in recent builds 1 Standard deviations, 10 Recent builds N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

If the row count check is set against the last successful check result, the check will evaluate the criteria according to the row count recorded in the previous passing check, and will not consider the results in failed checks.

Transaction file count

Checks the total number of files committed in one transaction, excluding log files.

Rule component Description Example options Required?
File size Total number of files committed in a transaction Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 N
Severity Severity of check failure Moderate, Critical Y
Median deviation Difference (in approximate standard deviations) from the median number of files in recent builds 1 Standard deviations, 10 Recent builds N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Transaction file size

Checks the total size of the files committed in one transaction, excluding log files.

Rule component Description Example options Required?
File size Total size of all files committed in a transaction (in MB or KB) Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 N
Severity Severity of check failure Moderate, Critical Y
Median deviation Difference (in approximate standard deviations) from the median file size in recent builds 1 Standard deviations, 10 Recent builds N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Content checks

Allowed column values

Checks if the values in a column match a list of allowed values.

Rule component Description Example options Required?
Column name Column name to check against FIRST_NAME Y
Allowed values Allowed possible values for above column John, Jane Y
Severity Severity of check failure Moderate, Critical Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Approximate unique percentage

Checks what percentage of values in a column are unique. The percentage is approximate. Note this means this check is not suitable for checking if a column is a primary key (100% unique values), use the primary key check instead.

Rule component Description Example options Required?
Column name Column name to check against FIRST_NAME Y
Unique percentage Values that are unique in the column (in %) Between 10 and 20, Greater than or equal to 50, Less than or equal to 50, Equal to 1 Y
Severity Severity of check failure Moderate, Critical Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Column regex

Checks if the values in a column match a certain regular expression.

Rule component Description Example options Required?
Column name Column name to check FIRST_NAME Y
Regex Regular expression the column should match ^Pre, post$, .*any.* Y
Severity Severity of check failure Moderate, Critical Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Approximate column relation

This check provides an estimate of similarity between two columns as a percentage. For an exact check, use data expectations instead.

Rule component Description Example options Required?
Other dataset Dataset to check against /Users/John Appleseed/Stock_Prices_Latest Y
Column 1 name Column name of the dataset on which the check is set FIRST_NAME Y
Column 2 name Column name of the other dataset f_name Y
Percentage match To what extent the two columns must match (in %) 85% of values are equal Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Date range

Checks for the range of values in a date column.

Rule component Description Example options Required?
Column name Name of the column to check LAST_UPDATED Y
Allowed date range Allowed date range for the column 2017-01-01 – 2018-01-01 Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Null percentage

Checks what percentage of values in a column are null.

Rule component Description Example options Required?
Column name Name of the column to check CUSTOMER_ID Y
Null percentage Percentage of values that are null in the column (in %) Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 N
Severity Severity of check failure Moderate, Critical Y
Median deviation Difference (in approximate standard deviations) from the median null percentage of recent builds 1 Standard deviations, 10 Recent builds N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Numeric mean

Checks whether the average of a numeric column meets some threshold.

Rule component Description Example options Required?
Column name Name of the numeric column to check NUM_FAILURES Y
Mean Desired mean of the column Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 N
Severity Severity of check failure Moderate, Critical Y
Difference from last check Compare the current mean of the column to the mean of the column at the last check run, ± an optional constant Greater than the last check + 5 N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Numeric median

Checks whether the median of a numeric column meets some threshold.

Rule component Description Example options Required?
Column name Name of the numeric column to check NUM_FAILURES Y
Median Desired median of the column Between 1 and 2, Greater than or equal to 1, Less than or equal to 1, Equal to 1 N
Severity Severity of check failure Moderate, Critical Y
Difference from last check Compare the current mean of the column to the mean of the column at the last check run, ± an optional constant Greater than the last check + 5 N
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Numeric range

Checks the range of values in a numeric column.

Rule component Description Example options Required?
Column name Name of the numeric column to check NUM_FAILURES Y
Allowed range Allowed range for the column 3-5 Y
Severity Severity of check failure Moderate, Critical Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Primary key

Checks that the values in a column are 100% unique and non-null.

Rule component Description Example options Required?
Column name Name of the column to check PART_ID Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Schema checks

Column

Checks for the existence and type of a column.

Rule component Description Example options Required?
Column name Name of the column to check for PART_ID Y
Is Present Check existence of column Y Y
Type Type of the column Integer Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Column count

Checks for the total number of columns in the dataset.

Rule component Description Example options Required?
Column count Total number of columns in the dataset 50 Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Schema

Checks the dataset schema, verifying that the schema is respecting the chosen comparison type (see below for more details on the available ones).

Rule component Description Example options Required?
Columns Enumerating the dataset columns and types - can choose full type match or column existence only Type: String Y
Comparison type Specify which comparison policy will be used Text Y
Notes Add a note to provide additional context Text N
Issues Automatically create an issue when this check fails Y, N N

Available schema check types are the following:

Value Comparison allowance
EXACT_MATCH_ORDERED_COLUMNS Checks column order, names and types, and number of columns.
EXACT_MATCH_UNORDERED_COLUMNS Checks column names and types, and number of columns. Order does not matter.
COLUMN_ADDITIONS_ALLOWED Checks column names and types. Extra columns are allowed, but columns cannot be missing.
COLUMN_ADDITIONS_ALLOWED_STRICT Like COLUMN_ADDITIONS_ALLOWED; however, whenever a new column is added to the dataset, that column is added to the check. Added columns cannot be missing thereafter.

Approximate standard deviation

Since dataset builds can easily have outliers, we do not use the true standard deviation. Instead, we use the median absolute deviation (MAD) which is a more robust measure of variability.

The MAD is defined as the median of the absolute deviations from the median of the data. For values x_1, ..., x_n with median X this means MAD = median(|x_i - X|).

The median absolute deviation can be used to approximate standard deviation by multiplying with a constant.

Our calculation is σ = MAD * 1.4826.

For detailed information see Median Absolute Deviation - Wikipedia ↗.


中文翻译


检查参考

本页面提供了关于可用健康检查类型的更详细文档。

类别 检查类型 支持的资源
状态 计划状态 数据集
状态 构建状态 数据集、Iceberg 表、虚拟表
状态 作业状态 数据集、Iceberg 表、虚拟表
状态 同步状态 数据集
时间 构建时长 数据集
时间 数据新鲜度 数据集
时间 同步时长 数据集
时间 同步新鲜度 数据集
时间 上次更新以来时间 数据集、Iceberg 表、虚拟表
时间 上次同步以来时间 数据集
大小 数据集文件数 数据集
大小 数据集分区 数据集
大小 行数 数据集
大小 事务文件数 数据集
大小 事务文件大小 数据集
内容 允许的列值 数据集
内容 近似唯一值百分比 数据集
内容 列正则表达式 数据集
内容 近似列关系 数据集
内容 日期范围 数据集
内容 空值百分比 数据集
内容 数值平均值 数据集
内容 数值中位数 数据集
内容 数值范围 数据集
内容 主键 数据集、Iceberg 表、虚拟表
模式 数据集、Iceberg 表、虚拟表
模式 列数 数据集
模式 模式 数据集、Iceberg 表、虚拟表

状态检查

计划状态

检查计划的最新构建是成功还是失败。

规则组件 描述 示例选项 是否必填?
严重级别 检查失败的严重级别 中等, 严重
升级 连续失败后是否升级严重级别 是, 否
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

计划状态检查代表了始终一起构建的管道或数据集集合的状态。因此,它会给出导致此最终数据集创建或更新的各个步骤的状态。

构建状态

检查数据集的最新构建是成功还是失败。

规则组件 描述 示例选项 是否必填?
严重级别 检查失败的严重级别 中等, 严重
升级 连续失败后是否升级严重级别 是, 否
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

构建状态检查代表了导致最终数据集构建的整个过程的状态。因此,它会给出导致此最终数据集创建或更新的各个步骤的状态。请注意,如果在此过程中更新或创建的中间数据集也附加了构建状态健康检查,这些检查将不会被更新。但是,所有这些中间数据集的作业状态将会被更新。

作业状态

检查数据集上最近一次作业运行是成功还是失败。

规则组件 描述 示例选项 是否必填?
严重级别 检查失败的严重级别 中等, 严重
升级 连续失败后是否升级严重级别 是, 否
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

作业状态检查独立于导致数据集刷新或创建的构建而触发。换句话说,无论相关数据集是否是给定构建的最终输出,作业状态检查都会针对特定数据集的每次构建运行。

何时使用作业状态、构建状态或计划状态检查

通常建议所有计划都设置计划状态检查。如果您的计划已有计划状态检查,则不建议在同一计划构建的其他数据集上安装作业状态检查,因为计划上任何作业失败都会触发计划状态检查。

如果您想检查数据集是否已更新,而不管构建中的其他数据集是否成功更新,请对中间数据集使用作业状态检查。如果需要,如果数据集是构建输出,并且您想检查整个构建以及所有数据集(包括此数据集)是否成功,请使用构建状态检查。

如果数据集是构建的唯一输出,则构建状态和作业状态将等效。如果数据集是中间数据集,或者构建有多个输出,并且数据集上的作业成功(或未运行),但构建中的其他作业失败并导致构建失败,则它们可能会有所不同。

同步状态

检查数据集到另一个数据库的最新同步是成功还是失败。

规则组件 描述 示例选项 是否必填?
同步目标 要监控的数据集的哪个同步,当数据集同步到多个目标时尤其相关。 phonograph2-cache-worker, jdbc-worker
严重级别 检查失败的严重级别 中等, 严重
升级 连续失败后是否升级严重级别 是, 否
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

时间检查

构建时长

检查构建完成所需的总时间是否满足某个阈值。

规则组件 描述 示例选项 是否必填?
构建时长 构建完成所需的总时间(以天、分钟或小时为单位) 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
中位数偏差 与近期构建完成中位时间相比的差异(以近似标准差为单位) 1 个标准差, 10 次近期构建
严重级别 检查失败的严重级别 中等, 严重
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

与构建状态检查一样,构建时长检查只会针对构建的终端输出进行更新。作为更大构建一部分且附加了构建时长检查的中间数据集将不会被更新。

数据新鲜度

检查数据集上最新事务的时间与时间戳列的最大值。如果列中的时间戳表示行添加的时间,则可用于衡量精确的数据新鲜度。

规则组件 描述 示例选项 是否必填?
列名 包含上次更新时间的列的名称。 LAST_UPDATED
新鲜度范围 在此时间范围内认为列的最新数据是“新鲜的”(以天、分钟或小时为单位) 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
严重级别 检查失败的严重级别 中等, 严重
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

同步时长

检查同步完成所需的总时间是否满足某个阈值。

规则组件 描述 示例选项 是否必填?
同步目标 要监控的数据集的哪个同步,当数据集同步到多个目标时尤其相关。 phonograph2-cache-worker, jdbc-worker
同步时长 同步完成所需的总时间(以天、分钟或小时为单位) 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
中位数偏差 与近期同步完成中位时间相比的差异(以近似标准差为单位) 1 个标准差, 10 次近期构建
严重级别 检查失败的严重级别 中等, 严重
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

同步新鲜度

检查数据集最新同步的时间与日期时间列的最大值。如果列中的时间戳表示行添加的时间,则可用于衡量精确的数据新鲜度。

规则组件 描述 示例选项 是否必填?
列名 包含上次更新时间的列的名称。 LAST_UPDATED
新鲜度范围 在此时间范围内认为列的最新数据是“新鲜的”(以天、分钟或小时为单位) 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
严重级别 检查失败的严重级别 中等, 严重
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

上次更新以来时间

检查自数据集更新(有新事务)以来的总时间是否满足某个阈值。

规则组件 描述 示例选项 是否必填?
上次更新 自数据集更新以来的总时间(以天、分钟或小时为单位) 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
中位数偏差 与近期构建中位更新时间相比的差异(以近似标准差为单位) 1 个标准差, 10 次近期构建
忽略空事务 检查自上次更新/中位数偏差的时间时,是否排除空事务。没有文件的事务将被忽略,如同它们不存在一样。 是, 否
严重级别 检查失败的严重级别 中等, 严重
计划 计划 检查自动运行还是手动运行 自动, 自定义计划
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

上次同步以来时间

检查自数据集上次同步到某个目标以来的总时间是否满足某个阈值。

规则组件 描述 示例选项 是否必填?
上次同步 自数据集上次同步到某个目标以来的总时间(以天、分钟或小时为单位) 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
中位数偏差 与近期构建中位更新时间相比的差异(以近似标准差为单位) 1 个标准差, 10 次近期构建
严重级别 检查失败的严重级别 中等, 严重
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

大小检查

数据集文件数

检查数据集最新视图中的文件总数。

规则组件 描述 示例选项 是否必填?
文件数 数据集最近视图中的文件总数 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
严重级别 检查失败的严重级别 中等, 严重
中位数偏差 与近期构建中位文件数相比的差异(以近似标准差为单位) 1 个标准差, 10 次近期构建
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

数据集分区

检查数据集的分区是否高效。

规则组件 描述 示例选项 是否必填?
备注 分区检查的工作原理如下:
- 如果文件总数少于 50 个,检查始终通过。
- 如果文件总数为 50 个或更多,则至少 90% 的文件大小超过 96MB 时检查通过。

如果检查失败,意味着数据在文件间的分区对于性能而言不是最优的,需要更好地对数据进行分区。
无可配置选项
问题 当此检查失败时自动创建问题 是, 否

行数

检查数据集中的总行数。

规则组件 描述 示例选项 是否必填?
行数 数据集中的总行数 介于 5001000 之间, 大于或等于 100, 小于或等于 1000, 等于 10
严重级别 检查失败的严重级别 中等, 严重
中位数偏差 与近期构建中位行数相比的差异(以近似标准差为单位) 1 个标准差, 10 次近期构建
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

如果行数检查设置为针对上次成功的检查结果,则检查将根据上次通过检查中记录的行数来评估标准,并且不会考虑失败检查中的结果。

事务文件数

检查一个事务中提交的文件总数,不包括日志文件。

规则组件 描述 示例选项 是否必填?
文件大小 一个事务中提交的文件总数 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
严重级别 检查失败的严重级别 中等, 严重
中位数偏差 与近期构建中位文件数相比的差异(以近似标准差为单位) 1 个标准差, 10 次近期构建
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

事务文件大小

检查一个事务中提交的文件总大小,不包括日志文件。

规则组件 描述 示例选项 是否必填?
文件大小 一个事务中提交的所有文件的总大小(以 MBKB 为单位) 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
严重级别 检查失败的严重级别 中等, 严重
中位数偏差 与近期构建中位文件大小相比的差异(以近似标准差为单位) 1 个标准差, 10 次近期构建
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

内容检查

允许的列值

检查列中的值是否与允许值列表匹配。

规则组件 描述 示例选项 是否必填?
列名 要检查的列名 FIRST_NAME
允许的值 上述列允许的可能值 John, Jane
严重级别 检查失败的严重级别 中等, 严重
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

近似唯一值百分比

检查列中唯一值的百分比。该百分比是近似的。请注意,这意味着此检查不适合检查列是否为主键(100% 唯一值),请改用主键检查

规则组件 描述 示例选项 是否必填?
列名 要检查的列名 FIRST_NAME
唯一值百分比 列中唯一的值(以 % 为单位) 介于 1020 之间, 大于或等于 50, 小于或等于 50, 等于 1
严重级别 检查失败的严重级别 中等, 严重
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

列正则表达式

检查列中的值是否匹配某个正则表达式。

规则组件 描述 示例选项 是否必填?
列名 要检查的列名 FIRST_NAME
正则表达式 列应匹配的正则表达式 ^Pre, post$, .*any.*
严重级别 检查失败的严重级别 中等, 严重
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

近似列关系

此检查提供两列之间相似性的估计百分比。如需精确检查,请改用数据期望

规则组件 描述 示例选项 是否必填?
其他数据集 要对照检查的数据集 /Users/John Appleseed/Stock_Prices_Latest
列 1 名称 设置检查的数据集的列名 FIRST_NAME
列 2 名称 其他数据集的列名 f_name
匹配百分比 两列必须匹配的程度(以 % 为单位) 85% 的值相等
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

日期范围

检查日期列中值的范围。

规则组件 描述 示例选项 是否必填?
列名 要检查的列的名称 LAST_UPDATED
允许的日期范围 列允许的日期范围 2017-01-01 – 2018-01-01
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

空值百分比

检查列中值为空的百分比。

规则组件 描述 示例选项 是否必填?
列名 要检查的列的名称 CUSTOMER_ID
空值百分比 列中为空值的百分比(以 % 为单位) 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
严重级别 检查失败的严重级别 中等, 严重
中位数偏差 与近期构建中位空值百分比相比的差异(以近似标准差为单位) 1 个标准差, 10 次近期构建
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

数值平均值

检查数值列的平均值是否满足某个阈值。

规则组件 描述 示例选项 是否必填?
列名 要检查的数值列的名称 NUM_FAILURES
平均值 列的期望平均值 介于 12 之间, 大于或等于 1, 小于或等于 1, 等于 1
严重级别 检查失败的严重级别 中等, 严重
与上次检查的差异 将列的当前平均值与上次检查运行时的列平均值进行比较,± 一个可选常数 大于上次检查 + 5
备注 添加备注以提供额外上下文 文本
问题 当此检查失败时自动创建问题 是, 否

数值中位数

检查数值列的中位数是否满足某个阈值。

规则组件 描述 示例选项 是否必填?
列名 要检查的数值列的名称 NUM_FAILURES
中位数 列的期望中位数