跳转至

Considerations: Pipeline Builder and Code Repositories(注意事项:Pipeline Builder 与代码仓库(Code Repositories))

Foundry has two products available for writing and managing data pipelines: Pipeline Builder and Code Repositories. These tools are complementary and are built to work together to provide solutions for all pipelining needs. The guide below is intended to help you determine which tool is best suited to your use case and how to use them in conjunction with each other.

Pipeline Builder

Pipeline Builder is Foundry’s primary application for fast, flexible, and scalable delivery of data pipelines while providing robustness and security. With Pipeline Builder, end users and data engineers can collaborate in a graph and form-based environment to integrate data, create business logic transformation, and define a rigorous release process for production pipelines. Users can write pipelines that provide real time feedback, with no need to use code. Additionally, Pipeline Builder uses health checks that guarantee only fully compliant data will be deployed to production. Learn more about Pipeline Builder.

Code Repositories

Code Repositories provides a web-based integrated development environment (IDE) for writing and collaborating on production-ready code in Foundry. The application provides a user-friendly way to interact with the underlying Git repository. Learn more about Code Repositories.

Best practices

We recommend building your pipeline design in Pipeline Builder. Doing so will:

  • Enable collaboration between different user groups through an easily understandable point-and-click interface.
  • Safeguard pipeline health by utilizing Pipeline Builder’s rails for safe and usage-efficient data transformations and pipeline management.

In cases where users require specialized code-based logic not available in Pipeline Builder, Code Repositories should be used to create those stages to add to the main pipeline. Some examples of these specialized cases include:

  • Making API calls
  • Using custom libraries
  • Adding code-based logical concepts

Since both Pipeline Builder and Code Repositories use Foundry datasets as inputs and outputs, a pipeline input built in Code Repositories can be added before, after, and in the middle of a pipeline in Pipeline Builder. Schedules and health checks can be configured for the full pipeline in Data Lineage, regardless of the application used to create the pipeline. Learn more about Data Lineage.

Feature summary

The following table describes the features and support available in Pipeline Builder and Code Repositories. As explained above, using both tools together allows you to create robust, type-safe, reusable pipelines with specialized, code-based logic.

Pipeline Builder Code Repositories
Recommended use Build and maintain production pipelines for organizations and specialized pipelines for cross-organization collaboration. Create specialized, code-based data transformations to add to a pipeline.
Build interface
Pipeline interface Graph and form-based Web-based integrated development environment (IDE)
Supported languages No code required Python, SQL, Java, Mesa
Reusabilty Copy and paste complete pipelines or pipeline stages. Reuse utility functions and libraries, and copy code between files.
Type-safe functions Strongly typed; errors are flagged immediately instead of at build time. Code-based; errors surfaced at build time.
Parameters User-defined persistent parameters that can be used across a pipeline. Code-defined constant can be used in a repository.
Supported pipelines
Batch pipeline Yes Yes
Streaming pipeline Yes Yes (for advanced users)
File Based transformation Yes Yes
Incremental computation Yes Yes
Filesystem and API access No Yes
Pipeline testing
Data preview scope Preview based on full dataset. Preview data sample.
Data preview timeline Preview updates in real time. Preview upon request.
Data preview checkpoints Preview each transformation step. Preview intermediary dataframes and variables at selected checkpoints in debug mode.
Debug Type-safe; errors surface while creating the pipeline and do not require checks or builds to debug. Debugger and Read-Eval-Print Loop (REPL) support.
Unit testing Yes Yes (for advanced users)
Pipeline management
Data expectations Yes Yes
Schedules Yes Yes
Publish custom libraries In development Yes
Versioning Full versioning workflow on rails for no-code/high-code user collaboration. Full Git workflow.
Build memory management Users can set an approved or custom compute profile. Code-based configuration is available.
Manage security markings Yes Yes

中文翻译

注意事项:Pipeline Builder 与代码仓库(Code Repositories)

Foundry 提供两种用于编写和管理数据管道的产品:Pipeline Builder代码仓库(Code Repositories)。这些工具互为补充,旨在协同工作,为所有管道需求提供解决方案。以下指南旨在帮助您确定哪种工具最适合您的用例,以及如何将它们结合使用。

Pipeline Builder

Pipeline Builder 是 Foundry 的主要应用程序,用于快速、灵活且可扩展地交付数据管道,同时提供稳健性和安全性。通过 Pipeline Builder,最终用户和数据工程师可以在基于图形和表单的环境中协作,以集成数据、创建业务逻辑转换,并为生产管道定义严格的发布流程。用户可以编写提供实时反馈的管道,无需使用代码。此外,Pipeline Builder 使用健康检查(health checks)来确保只有完全合规的数据才会部署到生产环境。了解更多关于 Pipeline Builder 的信息。

代码仓库(Code Repositories)

代码仓库(Code Repositories)提供了一个基于网络的集成开发环境(IDE),用于在 Foundry 中编写和协作开发生产就绪的代码。该应用程序提供了一种用户友好的方式来与底层 Git 仓库进行交互。了解更多关于代码仓库(Code Repositories)的信息。

最佳实践

我们建议在 Pipeline Builder 中构建您的管道设计。这样做将:

  • 通过易于理解的点击式界面,促进不同用户群体之间的协作。
  • 利用 Pipeline Builder 的框架(rails)进行安全且高效的数据转换和管道管理,从而保障管道健康。

在用户需要 Pipeline Builder 中不可用的、基于代码的专业逻辑时,应使用代码仓库(Code Repositories)来创建这些阶段,并将其添加到主管道中。这些专业用例的一些示例包括:

  • 进行 API 调用
  • 使用自定义库
  • 添加基于代码的逻辑概念

由于 Pipeline Builder 和代码仓库(Code Repositories)都使用 Foundry 数据集作为输入和输出,因此在代码仓库(Code Repositories)中构建的管道输入可以添加到 Pipeline Builder 中管道的前面、后面以及中间。无论使用哪种应用程序创建管道,都可以在数据沿袭(Data Lineage)中为整个管道配置调度和健康检查。了解更多关于数据沿袭(Data Lineage)的信息。

功能摘要

下表描述了 Pipeline Builder 和代码仓库(Code Repositories)中可用的功能和支持。如上所述,同时使用这两种工具可以让您创建具有专业、基于代码逻辑的稳健、类型安全、可重用的管道。

Pipeline Builder 代码仓库(Code Repositories)
推荐用途 为组织构建和维护生产管道,以及为跨组织协作构建专业管道。 创建专业的、基于代码的数据转换以添加到管道中。
构建界面
管道界面 基于图形和表单 基于网络的集成开发环境(IDE)
支持的语言 无需代码 Python, SQL, Java, Mesa
可重用性 复制并粘贴完整的管道或管道阶段。 重用实用函数和库,并在文件之间复制代码。
类型安全函数 强类型;错误会立即标记,而不是在构建时标记。 基于代码;错误在构建时出现。
参数 用户定义的持久参数,可在整个管道中使用。 代码定义的常量可在仓库中使用。
支持的管道
批处理管道
流式管道 是(适用于高级用户)
基于文件的转换
增量计算
文件系统和 API 访问
管道测试
数据预览范围 基于完整数据集进行预览。 预览数据样本。
数据预览时间线 实时更新预览。 按需预览。
数据预览检查点 预览每个转换步骤。 在调试模式下,在选定的检查点预览中间数据框和变量。
调试 类型安全;错误在创建管道时出现,无需检查或构建即可调试。 支持调试器和读取-求值-打印循环(REPL)。
单元测试 (适用于高级用户)
管道管理
数据期望
调度
发布自定义库 开发中
版本控制 为无代码/高代码用户协作提供完整的版本控制工作流框架。 完整的 Git 工作流。
构建内存管理 用户可以设置已批准或自定义的计算配置文件。 提供基于代码的配置。
管理安全标记