跳转至

Define data expectations(定义数据期望(Data Expectations))

Data expectations are a set of requirements defined in code on dataset, virtual table, and Iceberg table inputs or outputs. These requirements, or "expectations," can be used to create checks that improve data pipeline stability. If a Data Expectation check fails as part of a dataset build, the build can be automatically aborted in order to save time and resources, and avoid issues in downstream data. Data Expectations are integrated with Data Health for monitoring.

Get started by viewing the guide in Python Transforms documentation, or see the reference of all available expectations.

Benefits of using Data Expectations

  • Pipeline protection - Because Expectations run as part of the build, you can abort builds on failed expectations and prevent downstream failures and bad data from propagating to downstream resources.
  • Change management - Since Expectations are defined in Code Repositories, changing them will require the same pull-request review process that is set on your protected branches.
  • Proactive testing - Anticipate check failures before merging changes to protected branches by building them on a development branch

Terminology

  • Expectation - A strongly typed requirement on the data structure or content (e.g. column is not null)
  • Check - A meaningful expectation (can be a composite of multiple expectations) that is connected to a single dataset (output or input) in a transform. A check has a name that will be used when identifying and monitoring it (e.g. “object schema validation”).
  • Pre-condition - A check that is assigned to an input of a transform, typically to validate essential assumptions on the inputs structure or content before proceeding with the build.
  • Post-condition - A check that is assigned to the output of the transform, typically to guarantee dataset SLAs are maintained and downstream dependencies are protected.
  • Check result - Produced when the check runs (during build) and contains information on the expectations result and their breakdown. The check result can be monitored in Data Health.

How does it work?

Define

Data Expectations are defined on the dataset transform in the relevant Code Repository. Checks can be applied on the transform inputs and outputs (see the guide for details). The check name must be unique in a single transform.

Alongside its expectation, a check defines how failures are handled during build time. When a check fails the build can either be aborted or resumed with a warning.

The check is registered during CI on the relevant branch. Changing the expectations on protected branches will require a pull-request just like any other code change.

:::callout{theme="neutral"} When making changes to protected branches, it is recommended to build the dataset on the development branch to ensure your Data Expectations are met before merging changes to the default branch. :::

Run

The registered checks will run as part of the build job. Failure to meet data expectations will be highlighted in the Builds application and in the dataset History tab. If the check definition indicates FAIL on error, the job status will change to “Aborted” with an appropriate error. In the Job timeline you can find the “Expectations” indicator; clicking on the indicator will show the check results and breakdown of the different expectations.

:::callout{theme="neutral"} When a pre-condition fails the output of the transform will be aborted (rather than the input on which the pre-condition was defined). To abort builds of input datasets, the Data Expectation must be defined as a post-condition on the input dataset transform. :::

Monitor

Each check run produces a result that is reported to Data Health. The most recent Data Expectations results will be presented in the Dataset Preview application Health tab where notifications and issue triggers can be set (similar to other Data Health checks).

:::callout{theme="neutral"} Remember that checks on a dataset are uniquely identified by their name. The history of a check, as well as its individual monitoring settings, will remain only as long as its name doesn’t change. Changing the name of a check is equivalent to removing the old check and creating a new one in its place. :::

Incremental

All checks run on full datasets, regardless of the incremental nature of the transform.

:::callout{theme="neutral"} For example, let's assume we have a primary key check on the output of a transform running as incremental. Since data expectations checks are always run on the full dataset, the check will fail if a new primary key is included in the new transaction (which is about to be written, incrementally) and the same primary key has already been written (in a previous transaction). :::


中文翻译

定义数据期望(Data Expectations)

数据期望(Data Expectations) 是一组在代码中定义的要求,适用于数据集(dataset)、虚拟表(virtual table)Iceberg表(Iceberg table)的输入或输出。这些要求(即"期望")可用于创建检查,从而提高数据管道的稳定性。如果在数据集构建过程中数据期望检查失败,构建可以自动中止,以节省时间和资源,并避免下游数据出现问题。数据期望与数据健康(Data Health)集成,用于监控。

请先查看 Python转换(Python Transforms)文档中的指南,或参阅所有可用期望的参考文档

使用数据期望的优势

  • 管道保护 - 由于期望检查作为构建的一部分运行,您可以在期望失败时中止构建,防止下游故障和错误数据传播到下游资源。
  • 变更管理 - 由于期望在代码仓库(Code Repositories)中定义,更改它们将需要与受保护分支相同的拉取请求审查流程。
  • 主动测试 - 在将更改合并到受保护分支之前,通过在开发分支上构建来预判检查失败。

术语

  • 期望(Expectation) - 对数据结构或内容的强类型要求(例如,列不为空)
  • 检查(Check) - 一个有意义的期望(可以是多个期望的组合),连接到转换中的单个数据集(输出或输入)。检查有一个名称,用于识别和监控(例如,"对象模式验证")。
  • 前置条件(Pre-condition) - 分配给转换输入的检查,通常用于在继续构建之前验证对输入结构或内容的基本假设。
  • 后置条件(Post-condition) - 分配给转换输出的检查,通常用于确保数据集SLA得到维护,下游依赖得到保护。
  • 检查结果(Check result) - 在检查运行时(构建期间)生成,包含期望结果及其分解信息。检查结果可以在数据健康(Data Health)中监控。

工作原理

定义

数据期望在相关代码仓库(Code Repository)的数据集转换上定义。检查可以应用于转换的输入和输出(详情请参阅指南)。检查名称在单个转换中必须唯一。

除了期望之外,检查还定义了构建期间失败的处理方式。当检查失败时,构建可以中止或继续运行并发出警告。

检查在CI期间在相关分支上注册。更改受保护分支上的期望需要像其他代码更改一样提交拉取请求。

:::callout{theme="neutral"} 在对受保护分支进行更改时,建议在开发分支上构建数据集,以确保在将更改合并到默认分支之前满足数据期望。 :::

运行

注册的检查将作为构建作业的一部分运行。未满足数据期望的情况将在构建应用程序(Builds application)和数据集的历史记录标签页(History tab)中突出显示。如果检查定义指示出错时失败(FAIL on error),作业状态将更改为"已中止(Aborted)"并显示相应的错误。在作业时间线中,您可以找到"期望(Expectations)"指示器;点击该指示器将显示检查结果和不同期望的分解信息。

:::callout{theme="neutral"} 当前置条件(Pre-condition)失败时,转换的输出将被中止(而不是定义前置条件的输入)。要中止输入数据集的构建,数据期望必须定义为输入数据集转换上的后置条件(Post-condition)。 :::

监控

每次检查运行都会生成一个结果(result),该结果会报告给数据健康(Data Health)。最新的数据期望结果将显示在数据集预览应用程序(Dataset Preview application)的健康标签页(Health tab)中,您可以在其中设置通知和问题触发器(类似于其他数据健康检查)。

:::callout{theme="neutral"} 请记住,数据集上的检查由其名称唯一标识。检查的历史记录及其单独的监控设置将仅在其名称不变的情况下保留。更改检查的名称等同于删除旧检查并创建一个新检查。 :::

增量处理

所有检查都在完整数据集上运行,无论转换是否为增量性质。

:::callout{theme="neutral"} 例如,假设我们在一个增量运行的转换输出上有一个主键检查。由于数据期望检查始终在完整数据集上运行,如果新事务(即将增量写入)中包含新的主键,而该主键已在之前的事务中写入,则检查将失败。 :::