Incremental pipelines(增量管道(Incremental Pipelines))¶
Incremental pipelines are often used to process input datasets that change significantly over time. By avoiding unnecessarily computing on all the rows or files of data that have not changed, incremental pipelines enable lower end-to-end latency while minimizing compute costs.
However, incremental pipelines carry additional development and maintenance complexity that you should be aware of before getting started.
Background¶
Here are some of the factors related to incremental pipelines you may want to consider:
- Developing an incremental pipeline requires a thorough understanding of how datasets change over time in Foundry using transactions. You will need to interact with the concepts of dataset transactions in Data Connection syncs and transformation logic to effectively create and manage an incremental pipeline over time.
- Once you understand how transactions work in Foundry, you will need to design your pipeline to be resilient to unexpected transactions in your input datasets. Although incremental pipelines generally only process changed data that arrives in the form of
APPENDtransactions, your logic must be resilient to input datasets occasionally being recomputed, which results in aSNAPSHOTtransaction. Ideally, your transformation logic should be written with thorough unit tests to validate behavior before this happens in practice. - To ensure incremental pipelines remain performant in the long run, you will need to understand how datasets change over time when many
APPENDtransactions are applied, causing datasets to consist of a large volume of small files. This includes understanding how Spark handles large numbers of files and how this affects Spark partitioning. Read more about maintaining high performance for incremental pipelines.
Getting started¶
Get started with incremental pipelines by reviewing the following recommended resources:
- Learn how to create incremental syncs to bring data into Foundry incrementally.
- See an example of how to create an incremental pipeline with Pipeline Builder.
- Refer to the Python incremental overview to learn about developing incremental transform logic.
中文翻译¶
增量管道(Incremental Pipelines)¶
增量管道通常用于处理随时间显著变化的输入数据集。通过避免对未发生变化的所有数据行或文件进行不必要的计算,增量管道能够在降低计算成本的同时实现更低的端到端延迟。
然而,在开始使用之前,您需要了解增量管道带来的额外开发和维护复杂性。
背景信息¶
以下是您可能需要考虑的与增量管道相关的一些因素:
- 开发增量管道需要深入理解数据集在 Foundry 中如何通过事务(Transactions)随时间变化。您需要与 Data Connection 同步和转换逻辑中的数据集事务概念进行交互,才能有效地创建和管理增量管道。
- 一旦理解了事务在 Foundry 中的工作原理,您就需要设计管道以应对输入数据集中可能出现的事务异常。尽管增量管道通常只处理以
APPEND事务形式到达的变更数据,但您的逻辑必须能够应对输入数据集偶尔被重新计算的情况,这会产生SNAPSHOT事务。理想情况下,您应该编写包含全面单元测试(Unit Tests)的转换逻辑,以便在实际发生这种情况之前验证其行为。 - 为确保增量管道长期保持高性能,您需要了解当应用大量
APPEND事务导致数据集由大量小文件组成时,数据集如何随时间变化。这包括理解 Spark 如何处理大量文件以及这如何影响 Spark 分区。了解更多关于保持增量管道高性能的内容。
入门指南¶
通过查看以下推荐资源开始使用增量管道:
- 了解如何创建增量同步(Create Incremental Syncs)以增量方式将数据导入 Foundry。
- 查看如何使用 Pipeline Builder 创建增量管道的示例。
- 参考 Python 增量概述(Python Incremental Overview),了解如何开发增量转换逻辑。