Types of checks(检查类型)¶
This page outlines various types of checks available in Data Health, including job-level checks, build-level checks, schedule-level checks, and freshness checks.
Job-level checks vs. schedule-level checks¶
This section describes the job status, schedule status, and schedule duration checks.
The definitions below clarify what a job and a schedule are in Foundry:
- Job: a Spark computation defined by the logic in a single transform. In other words, a job is a single transform that produces a single dataset (or several if a multi-output transform is used). Jobs are broken down into a set of stages.
- Schedule: a collection of jobs with defined target datasets that are configured to run on a reoccurring basis. Schedules can be used to represent a subset of a pipeline that builds together.
We use the following Data Health checks to ensure jobs and schedules are running successfully:
- Job status: This is triggered whenever the dataset on which it is installed is refreshed or is created as a part of any build. A job status check will succeed if the target dataset successfully builds, even if the build it is a part of fails downstream. However, note that if the build fails upstream of the target dataset, your target dataset will register as having a cancelled build and the job status will not be evaluated for the target dataset.
- Schedule duration & schedule status: These allow you to monitor the status of a schedule build, including all intermediates. Note that this is only triggered when the check's configured schedule runs.
- In general it is recommended that all schedules have schedule status checks. If you already have a schedule status check, installing job status checks on other datasets built by the same schedule is not recommended, as any job failing on the schedule will trigger a schedule status check.
When trying to determine when and where to place job or schedule status checks, see our guide on which health checks to apply.
For more details and further clarification on the checks themselves, see the checks reference for schedule status and job status.
Freshness checks¶
This section describes the sync freshness, data freshness, and time since last updated checks.
All three of these checks are concerned with “freshness” (how up-to-date some aspect of your data is), but they all use different methods to evaluate freshness:
- Time since last updated: Evaluates freshness of the dataset. This check calculates how much time has elapsed between the current time and the last transaction committed, even if the transaction was empty; an empty transaction does not change the data in the dataset.
- Data freshness: Evaluates freshness of the data in the dataset by calculating how much time has elapsed between the last transaction committed and the maximum value of a timestamp column. This check is only run when a transaction is committed.
- Sync freshness: Evaluates freshness of the data in the synced dataset by calculating how much time has elapsed between the time of the latest sync of a dataset and the maximum value of a datetime column.
For both data and sync freshness, it is ideal if the timestamp in the column represents the time when the row was added in the source system.
When trying to determine when and where to place freshness checks, see our guide on what health checks to apply.
For more details on the checks themselves, see the checks reference for time since last updated, data freshness, and sync freshness.
Can I abort builds when health checks fail?¶
Most standard health checks depend on jobs to finish in order to compute. If your dataset is created in a Code Repository, you can use data expectations to define checks that run during build time. This will allow you to abort the build on error and monitor the checks using Data Health.
中文翻译¶
检查类型¶
本文概述了 Data Health 中可用的各种检查类型,包括作业级检查(job-level checks)、构建级检查(build-level checks)、调度级检查(schedule-level checks)和新鲜度检查(freshness checks)。
作业级检查与调度级检查¶
本节介绍作业状态(job status)、调度状态(schedule status)和调度持续时间(schedule duration)检查。
以下定义阐明了 Foundry 中作业和调度的概念:
- 作业: 由单个转换(transform)中的逻辑定义的 Spark 计算。换句话说,作业是一个产生单个数据集(如果使用多输出转换,则可能产生多个数据集)的转换。作业被分解为一系列阶段(stages)。
- 调度: 一组具有定义的目标数据集的作业,这些作业被配置为定期运行。调度可用于表示共同构建的管道子集。
我们使用以下 Data Health 检查来确保作业和调度成功运行:
- 作业状态: 每当安装该检查的数据集被刷新或作为任何构建的一部分被创建时,该检查就会被触发。如果目标数据集成功构建,即使其所属的构建在下游失败,作业状态检查也会成功。但请注意,如果构建在目标数据集上游失败,目标数据集将记录为构建已取消,并且不会对目标数据集评估作业状态。
- 调度持续时间与调度状态: 这些检查允许您监控调度构建的状态,包括所有中间环节。请注意,这仅在检查配置的调度运行时触发。
- 通常建议所有调度都配置调度状态检查。如果您已有调度状态检查,则不建议在同一调度构建的其他数据集上安装作业状态检查,因为调度上的任何作业失败都会触发调度状态检查。
在确定何时何地放置作业或调度状态检查时,请参阅我们的指南应应用哪些健康检查。
有关检查本身的更多详细信息和进一步说明,请参阅调度状态和作业状态的检查参考文档。
新鲜度检查¶
本节介绍同步新鲜度(sync freshness)、数据新鲜度(data freshness)和上次更新以来时间(time since last updated)检查。
这三种检查都涉及"新鲜度"(数据的某些方面有多新),但它们使用不同的方法来评估新鲜度:
- 上次更新以来时间: 评估数据集的新鲜度。此检查计算当前时间与上次提交的事务之间的时间间隔,即使事务为空也是如此;空事务不会更改数据集中的数据。
- 数据新鲜度: 通过计算上次提交的事务与时间戳列的最大值之间的时间间隔,评估数据集中数据的新鲜度。此检查仅在提交事务时运行。
- 同步新鲜度: 通过计算数据集最新同步时间与日期时间列的最大值之间的时间间隔,评估同步数据集中数据的新鲜度。
对于数据新鲜度和同步新鲜度,理想情况下,列中的时间戳应表示该行在源系统中添加的时间。
在确定何时何地放置新鲜度检查时,请参阅我们的指南应应用哪些健康检查。
有关检查本身的更多详细信息,请参阅上次更新以来时间、数据新鲜度和同步新鲜度的检查参考文档。
健康检查失败时可以中止构建吗?¶
大多数标准健康检查依赖于作业完成才能进行计算。如果您的数据集是在代码仓库(Code Repository)中创建的,您可以使用数据期望来定义在构建期间运行的检查。这将允许您在出错时中止构建,并使用 Data Health 监控检查。