跳转至

What is a data pipeline?(什么是数据管道?)

The overall goal of data integration in Foundry is to provide a digital view of the objective reality within your Organization. Achieving this goal typically requires syncing data from many source systems, imposing a common schema, combining datasets together, and enabling teams to build use cases off a common data foundation.

Within this context, the term "data pipeline" is widely used to refer to the flow of data from a source system through intermediate datasets to ultimately produce high-quality, curated datasets that can be structured into the Ontology or serve as the foundation of machine learning and analytical workflows.

Although any two datasets in Foundry that are connected together via transformation logic could be considered a pipeline, in practice what we refer to as a "data pipeline" is more constrained. Typically, a pipeline has a notion of ownership—a person or group of people who oversees the pipeline to ensure that data flows through it regularly and reliably to power business processes.

Beyond a notion of ownership, there are several other characteristics associated with a high-quality, production-ready data pipeline. We explore these ideas throughout the rest of this document and provide links to additional resources to learn more:

In addition to features common to all pipelines, consider which type of pipeline should be created for your data foundation based on factors such as data scale, latency requirements, and maintenance complexity. There are three primary types of pipelines available in Foundry: batch, incremental, and streaming. Learn more about the types of pipelines.

Pipeline setup

Foundry's Pipeline Builder enables users to quickly and easily set up a pipeline with a streamlined point-and-click interface. With Pipeline Builder, users gain the benefits of Git-style change management, data health checks, multi-modal security, and fine-grain data auditing.

Technical users can build and maintain pipelines more rapidly than ever before, focusing on declarative descriptions of their end-to-end pipelines and desired outputs. Additionally, Pipeline Builder's point-and-click, form-based interface enables less technical users to create pipelines through a simplified approach.

Build scheduling

Simply put, a series of data transformations must run regularly in order to be considered a data pipeline. Defining build schedules in Foundry is a basic step in building a pipeline, as downstream data consumers expect data to update regularly. The frequency with which data flows through a pipeline is subject to organizational requirements: some pipelines may run only weekly or daily, while others run on an hourly or even more frequent basis.

The following resources help you get started with scheduling builds in Foundry:

Data quality

In the initial stages of defining a pipeline, we recommend frequently checking the quality of inputs and outputs at every step. Data synced from source systems often includes undefined values and poorly formatted or inconsistent data. Cleaning and normalizing data is a core part of the pipeline building process.

Tools for checking assumptions about datasets are available throughout Foundry:

  • Dataset Previews support computing statistics on any column of a dataset, as well as filtering to a subset of rows to quickly check expectations.
  • Code Repositories' support for debugging transforms can be used to check that input datasets are structured as expected while authoring transformation logic.
  • Applications in Foundry's analytical suite, especially Contour, can be very helpful for validating assumptions about datasets in a point-and-click fashion.

After your pipeline has been established, health checks are the recommended way to validate that data remains high-quality over time. Here are some resources to get started with health checks:

Security and governance

Foundry's platform security primitives provide best-in-class capabilities for securing a data foundation and ensuring that sensitive data is handled appropriately. The cross-cutting concepts of Projects and Markings provide support for discretionary and mandatory controls, respectively, which can be used to comply with the full range of governance requirements.

To learn more about how to handle data securely in your pipeline, refer to these sections:

Support processes and documentation

Once a pipeline has been published to production by following the above guidelines, it is important to think through the longevity of the pipeline from an organizational perspective. Support processes for pipeline maintenance should be fleshed out, expectations should be clearly defined, and documentation should be available so that pipelines remain high-quality even as they are handed off from one team to another.

Learn more about these best practices:


中文翻译

什么是数据管道?

在 Foundry 中,数据集成(Data Integration)的总体目标是提供组织内客观现实的数字化视图。实现这一目标通常需要从多个源系统同步数据,应用统一的模式(Schema),将数据集组合在一起,并使团队能够基于共同的数据基础构建用例。

在此背景下,"数据管道(Data Pipeline)"这一术语被广泛用于描述数据从源系统流经中间数据集,最终生成高质量、经过整理的数据集的过程。这些数据集可以结构化到本体论(Ontology)中,或作为机器学习(Machine Learning)分析(Analytical)工作流的基础。

虽然 Foundry 中通过转换逻辑连接在一起的任意两个数据集都可以被视为管道,但在实践中,我们所说的"数据管道"有更严格的限定。通常,管道具有所有权(Ownership)的概念——由一个人或一组人负责监督管道,确保数据定期、可靠地流经管道以支持业务流程。

除了所有权概念外,高质量、生产就绪的数据管道还具有其他几个特征。我们将在本文档的其余部分探讨这些概念,并提供指向更多资源的链接:

除了所有管道共有的功能外,还需根据数据规模、延迟要求和维护复杂性等因素,考虑为数据基础创建哪种类型的管道。Foundry 提供三种主要类型的管道:批处理(Batch)、增量(Incremental)和流式(Streaming)。了解更多关于管道类型的信息。

管道设置

Foundry 的管道构建器(Pipeline Builder)使用户能够通过简化的点击式界面快速轻松地设置管道。使用管道构建器,用户可以享受 Git 风格变更管理、数据健康检查、多模式安全以及细粒度数据审计等优势。

技术用户能够比以往更快地构建和维护管道,专注于端到端管道的声明式描述和期望输出。此外,管道构建器的点击式、基于表单的界面使非技术用户能够通过简化方法创建管道。

构建调度

简而言之,一系列数据转换必须定期运行才能被视为数据管道。在 Foundry 中定义构建调度(Schedules)是构建管道的基本步骤,因为下游数据消费者期望数据定期更新。数据流经管道的频率取决于组织需求:有些管道可能每周或每天运行一次,而其他管道则每小时甚至更频繁地运行。

以下资源可帮助您开始在 Foundry 中调度构建:

数据质量

在定义管道的初始阶段,我们建议频繁检查每个步骤的输入和输出质量。从源系统同步的数据通常包含未定义的值以及格式不佳或不一致的数据。清理和规范化数据是管道构建过程的核心部分。

Foundry 中提供了多种工具用于检查关于数据集的假设:

  • 数据集预览(Dataset Previews)支持计算数据集任何列的统计信息,以及筛选行子集以快速检查预期。
  • 代码仓库(Code Repositories)对调试转换(Debugging Transforms)的支持可用于在编写转换逻辑时检查输入数据集是否按预期结构化。
  • Foundry 分析套件中的应用程序,特别是 Contour,对于通过点击式方式验证关于数据集的假设非常有用。

在管道建立后,健康检查(Health Checks)是验证数据长期保持高质量的建议方法。以下是一些开始使用健康检查的资源:

安全与治理

Foundry 的平台安全原语提供了业界领先的能力,用于保护数据基础并确保敏感数据得到适当处理。项目(Projects)标记(Markings)的横切概念分别支持自主控制和强制控制,可用于满足全方位的治理要求。

要了解如何在管道中安全地处理数据,请参考以下部分:

支持流程与文档

一旦按照上述指南将管道发布到生产环境,从组织角度考虑管道的长期性就变得非常重要。应完善管道维护的支持流程,明确定义预期,并提供文档,以便即使管道从一个团队移交给另一个团队,也能保持高质量。

了解更多关于这些最佳实践的信息: