跳转至

Core concepts(核心概念)

This page provides an introduction to the core concepts of Foundry data integration that are relevant to Pipeline Builder.

Build

The concepts of datasets, branches, transforms, and outputs are fundamental to Pipeline Builder. We recommend reviewing these topics before building your first pipeline and as you transform your data and integrate towards pipeline outputs.

Datasets

Datasets are the building blocks of a pipeline. In the data integration process, data is represented as Foundry datasets from when the data lands in Foundry until the data is mapped into the Ontology object model.

Fundamentally, a Foundry dataset is a wrapper around a collection of files which are stored in a backing file system. Pipeline Builder is primarily intended for structured data but can also be used for semi-structured data.

Learn more about input datasets in Pipeline Builder.

Branches

Version control is crucial to maintaining healthy pipeline workflows. In Pipeline Builder, version control is implemented with pipeline branches, which operate similarly to code branches in Git version control.

A pipeline branch is a copy of the pipeline on which a user can iterate without saving back to the main pipeline, similar to a code branch in Git. Users can make changes, preview, save, and build on their branch. Once they are happy with the changes, they can propose to merge back into the Main branch, similar to merging a Git pull request.

Learn more about branches in Pipeline Builder.

Transforms

A transform can be thought of as a function definition; that is, a transform accepts a set of inputs (such as datasets) and produces a set of outputs. A pipeline is a linkage of datasets, data expectations, and targeted data outputs that are connected by transforms.

Learn more about transforms in Pipeline Builder.

Pipeline outputs

Outputs in Pipeline Builder are the result of transforms performed in the pipeline and can be datasets, virtual tables, or Ontology components such as object types, object link types, or time series. Outputs can be used in other Foundry applications such as Quiver or Code Workbook.

Learn more about pipeline outputs in Pipeline Builder.

Manage

The concepts of schedules and data expectations are useful for maintaining healthy, stable pipelines. We recommend learning more about these topics once you build your first pipeline.

Schedules

Schedules are used to run dataset builds on a recurring basis to keep data flowing through Foundry consistently. In Pipeline Builder, builds can be scheduled at a specific time, on a specific cadence, or based on the status of a parent resource; for example, you can set a build to occur when an upstream dataset is updated.

Learn more about schedules in Pipeline Builder.

Data expectations

Pipeline Builder supports data expectations on outputs and intermediate transforms through unit tests. Data expectations are requirements that can be applied to dataset outputs. These requirements (known as "expectations") can be used to create checks that improve data pipeline stability.

Data expectations can be set on each pipeline output to define an expectation on the resulting output. Pipeline Builder currently supports two data expectation types: primary key and row count.

If any expectations fail, the build will be failed. The job expectations pane will show which data expectations passed and failed.

Learn more about data expectations in Pipeline Builder.


中文翻译

核心概念

本页介绍了与 Pipeline Builder 相关的 Foundry 数据集成核心概念。

构建

数据集(Datasets)分支(Branches)转换(Transforms)输出(Outputs)等概念是 Pipeline Builder 的基础。我们建议您在构建首个管道前,以及在转换数据并将其集成到管道输出的过程中,深入了解这些主题。

数据集

数据集是管道的构建块。在数据集成过程中,从数据接入 Foundry 开始,直到数据映射到本体(Ontology)对象模型为止,数据始终以 Foundry 数据集的形式表示。

从根本上说,Foundry 数据集是对存储在底层文件系统中的文件集合的封装。Pipeline Builder 主要用于结构化数据,但也可用于半结构化数据。

了解有关 Pipeline Builder 中输入数据集的更多信息。

分支

版本控制对于维护健康的管道工作流至关重要。在 Pipeline Builder 中,版本控制通过管道分支来实现,其运作方式类似于 Git 版本控制中的代码分支。

管道分支是管道的一个副本,用户可以在其上进行迭代而无需保存回主管道,这与 Git 中的代码分支类似。用户可以在自己的分支上进行更改、预览、保存和构建。对更改满意后,他们可以提议将其合并回 Main 分支,这类似于合并 Git 拉取请求。

了解有关 Pipeline Builder 中分支的更多信息。

转换

转换可以被视为函数定义;也就是说,转换接受一组输入(如数据集)并生成一组输出。管道是由转换连接起来的数据集、数据预期和目标数据输出的链接集合。

了解有关 Pipeline Builder 中转换的更多信息。

管道输出

Pipeline Builder 中的输出是管道中执行的转换的结果,可以是数据集、虚拟表(Virtual tables)或本体组件(如对象类型、对象链接类型或时间序列)。输出可用于其他 Foundry 应用程序,如 Quiver 或 Code Workbook。

了解有关 Pipeline Builder 中管道输出的更多信息。

管理

计划(Schedules)数据预期(Data expectations)的概念有助于维护健康、稳定的管道。我们建议您在构建首个管道后,进一步了解这些主题。

计划

计划用于定期运行数据集构建(Dataset builds),以保持数据在 Foundry 中持续稳定地流转。在 Pipeline Builder 中,可以安排在特定时间、按特定频率或基于父资源的状态来执行构建;例如,您可以设置在上游数据集更新时触发构建。

了解有关 Pipeline Builder 中计划的更多信息。

数据预期

Pipeline Builder 通过单元测试支持对输出和中间转换应用数据预期。数据预期是可以应用于数据集输出的要求。这些要求(称为“预期”)可用于创建检查,从而提高数据管道的稳定性。

可以在每个管道输出上设置数据预期,以定义对结果输出的预期。Pipeline Builder 目前支持两种数据预期类型:主键和行数。

如果任何预期未通过,构建将会失败。作业预期窗格将显示哪些数据预期通过或未通过。

了解有关 Pipeline Builder 中数据预期的更多信息。