跳转至

Types of pipelines(管道类型)

There are three main types of pipelines you can create in Foundry, and each provides different tradeoffs according to a few criteria:

  • Latency. How long does the pipeline take to run end-to-end?
  • Complexity. How difficult is it to author the pipeline and maintain it over time?
  • Compute cost. How much does the pipeline actually cost in terms of compute hours?
  • Resilience to change in data scale. How much additional data will flow into your pipeline over time?

The three types of pipelines are:

  • Batch pipelines fully recompute every dataset that has changed on each run.
  • Incremental pipelines only process new data that has changed since the last run.
  • Streaming pipelines run continuously and process new data as it arrives in the platform.

Below, we discuss each type of pipeline, its tradeoffs, and how to get started authoring this type of pipeline. For convenience, here is a summary table of the types of pipelines according to the tradeoffs mentioned above.

Criterion Batch Incremental Streaming
Latency High Low Very low
Complexity Low Medium High
Compute cost Medium Low High
Resilience to change in data scale Low High High

Foundry also offers faster versions of batch and incremental pipelines, which can improve performance at the cost of not supporting all expressions and transforms available in standard pipelines.

Batch

In a batch pipeline, all datasets in the pipeline are fully recomputed whenever upstream data changes. Because everything is recomputed, the end-to-end performance of the pipeline is very consistent over time, and the code and maintenance complexity of the pipeline is minimal. To enable more users to contribute to batch pipelines, a broad set of languages and tools are available for batch pipeline authoring, including SQL.

Examining batch pipelines according to the criteria above:

  • Latency of batch pipelines can be very high, as all data must be processed whenever it lands into the platform, even if it has not changed since the last sync. However, latency may be low if the overall data scale is small.
  • Complexity of batch pipelines is very low. Although it is usually still necessary to understand table manipulation concepts such as filters, joins, and aggregations, minimal further knowledge is necessary.
  • Compute costs for batch pipelines can be high, as a large amount of repeated computation occurs on every build of the pipeline. Again, this factor can be ignored if overall data scale is small.
  • Resilience to change in data scale is low. If large amounts of new data flow into the pipeline, such as when input datasets represents high-volume events, batch pipelines will become unmanageably expensive and take too long to run.

In most cases, you should begin pipeline development in Foundry by creating a batch pipeline and extending it to support incremental computation as the use case for the pipeline is validated. In many cases, you can keep using a batch pipeline indefinitely, especially if your data scale is low (e.g., less than tens of millions of rows).

If you expect that you will need to make your pipeline incremental in the future, we recommend using either Python or Java for batch pipeline development, as these languages support incremental computation.

Get started by learning how to create a batch pipeline in Pipeline Builder, or by following the tutorials for other languages:

Faster

:::callout{theme="neutral"} Faster pipelines were previously known as lightweight pipelines, as the term "lightweight" referred to the reduced time and compute resources required to execute these pipelines as opposed to the size of the data they handle. The name change reflects that faster pipelines reduce both execution time and compute resource usage, even for large-scale datasets. :::

Pipeline Builder now supports faster pipelines, which can provide faster execution in certain conditions. It uses a backend powered by DataFusion ↗, an open-source query engine written in Rust ↗. Compared to traditional Spark-based pipelines, faster pipelines can substantially accelerate compute processes for small to medium-sized datasets.

You can create a faster pipeline from scratch or convert an existing batch or incremental pipeline to a faster pipeline. Learn more about faster pipelines in Pipeline Builder

Incremental

In an incremental pipeline, only the rows or files of data that have changed since the last build are computed. This is suitable for processing event data and other datasets with large amounts of data changing over time. In addition to reducing the overall amount of computation, the end-to-end latency of the pipeline can be reduced significantly as compared to batch pipelines. Only Python and Java APIs are available for incremental computation.

Examining incremental pipelines according to the criteria above:

  • Latency of incremental pipelines can be made very low—on the order of minutes. In our experience, this is sufficiently low latency to meet the vast majority of organizational requirements.
  • Complexity of incremental pipelines is medium-to-high. Rather than just operating on the high-level concept of a tabular dataframe, writing and maintaining an incremental pipeline requires understanding how data flows into Foundry dataset transactions, how to handle cases when input datasets are updated non-incrementally, and how to maintain high performance over time.
  • Compute costs for incremental pipelines can be lower than for batch pipelines on high-scale datasets, as the amount of actual computation occurring can be minimized.
  • Resilience to change in data scale is high. Because only new data is processed, incremental pipelines avoid needing to redo computation on large datasets where most of the data is not changing.

To learn more about incremental pipelines, refer to these resources:

Streaming

In a streaming pipeline, your code runs continuously to process any new data that streams into Foundry, enabling the lowest levels of latency but incurring the highest amounts of complexity and compute costs. In general, it is helpful to think about streaming pipelines as closer to managing a microservice than managing a compute job—you need to be very thoughtful about uptime, resiliency, and stateful operations in order to run a streaming pipeline successfully.

Examining streaming pipelines according to the criteria above:

  • Latency of streaming pipelines can be made very low. If you have a strong requirement for data to be available for a use case in less than one minute, then streaming pipelines may be a good fit.
  • Complexity of streaming pipelines is very high. Writing these pipelines requires avoiding a broad set of patterns, such as stateful operations, which can unexpectedly cause instability in the future. In addition, streaming pipelines tend to have lower tolerance to failure than batch or incremental pipelines, as downtime can result in use cases that depend on consistently fresh data to consider any interruption as an outage.
  • Compute costs for streaming pipelines can be very high, as the nature of streaming pipelines requires compute resources to always be available to process new input data.
  • Resilience to change in data scale is high. Streaming pipelines are designed to support high throughput and can typically process higher changes in data scale than batch or even incremental pipelines.

In most cases, it is best to avoid creating a streaming pipeline unless your use case has very low latency requirements. Incremental pipelines can often be made performant down to minute-level end-to-end latencies to meet most needs without incurring the added complexity and compute costs of streaming pipelines.

To learn more about streaming pipelines, refer to the following resources:

Additional documentation for streaming pipelines will be available soon. If you are interested in building a streaming pipeline, contact your Palantir representative.


中文翻译

管道类型

在 Foundry 中,您可以创建三种主要类型的管道,每种类型在以下几个标准上提供了不同的权衡:

  • 延迟(Latency)。管道端到端运行需要多长时间?
  • 复杂性(Complexity)。编写管道并长期维护的难度如何?
  • 计算成本(Compute cost)。管道实际消耗的计算时长成本是多少?
  • 对数据规模变化的弹性(Resilience to change in data scale)。随着时间的推移,将有多少额外数据流入您的管道?

三种管道类型分别是:

  • 批处理(Batch)管道在每次运行时完全重新计算所有已更改的数据集。
  • 增量(Incremental)管道仅处理自上次运行以来发生变化的新数据。
  • 流式(Streaming)管道持续运行,并在新数据到达平台时进行处理。

下面,我们将讨论每种管道类型、其权衡以及如何开始编写此类管道。为方便起见,以下是根据上述权衡标准总结的管道类型表格。

标准 批处理 增量 流式
延迟 非常低
复杂性 中等
计算成本 中等
对数据规模变化的弹性

Foundry 还提供批处理和增量管道的更快版本,这些版本可以提高性能,但代价是不支持标准管道中所有可用的表达式和转换。

批处理

批处理管道中,每当上游数据发生变化时,管道中的所有数据集都会被完全重新计算。由于所有内容都会被重新计算,管道的端到端性能随时间推移非常稳定,并且代码和维护复杂性最低。为了让更多用户能够参与批处理管道的开发,批处理管道编写支持多种语言和工具,包括 SQL。

根据上述标准审视批处理管道:

  • 延迟:批处理管道的延迟可能非常高,因为每当数据进入平台时,即使自上次同步以来没有变化,也必须处理所有数据。然而,如果整体数据规模较小,延迟可能较低。
  • 复杂性:批处理管道的复杂性非常低。虽然通常仍需要理解表操作概念(如过滤、连接和聚合),但几乎不需要更多知识。
  • 计算成本:批处理管道的计算成本可能较高,因为每次构建管道时都会发生大量重复计算。同样,如果整体数据规模较小,可以忽略此因素。
  • 对数据规模变化的弹性较低。如果有大量新数据流入管道(例如输入数据集代表高容量事件),批处理管道将变得难以管理且成本高昂,运行时间也会过长。

在大多数情况下,您应该首先在 Foundry 中通过创建批处理管道开始管道开发,并在用例得到验证后扩展以支持增量计算。在许多情况下,您可以无限期地继续使用批处理管道,特别是当您的数据规模较小(例如,少于数千万行)时。

如果您预计将来需要将管道改为增量模式,我们建议使用 Python 或 Java 进行批处理管道开发,因为这些语言支持增量计算。

开始学习如何在 Pipeline Builder 中创建批处理管道,或按照其他语言的教程操作:

更快

:::callout{theme="neutral"} 更快管道以前被称为轻量级管道,因为"轻量级"指的是执行这些管道所需的时间和计算资源减少,而非它们处理的数据大小。名称变更反映了更快管道即使在处理大规模数据集时也能减少执行时间和计算资源使用。 :::

Pipeline Builder 现在支持更快管道,可以在特定条件下提供更快的执行速度。它使用由 DataFusion ↗ 驱动的后端,这是一个用 Rust ↗ 编写的开源查询引擎。与传统的基于 Spark 的管道相比,更快管道可以显著加速中小型数据集的计算过程。

您可以从头开始创建更快管道,或将现有的批处理或增量管道转换为更快管道。了解更多关于 Pipeline Builder 中的更快管道

增量

增量管道中,仅计算自上次构建以来发生变化的数据行或文件。这适用于处理事件数据和其他随时间变化大量数据的场景。除了减少整体计算量外,与批处理管道相比,管道的端到端延迟也可以显著降低。只有 Python 和 Java API 可用于增量计算。

根据上述标准审视增量管道:

  • 延迟:增量管道的延迟可以做到非常低——大约几分钟。根据我们的经验,这种低延迟足以满足绝大多数组织需求。
  • 复杂性:增量管道的复杂性为中等至高等。编写和维护增量管道不仅需要理解表格数据框的高级概念,还需要了解数据如何流入 Foundry 数据集事务、如何处理输入数据集非增量更新的情况,以及如何随时间保持高性能。
  • 计算成本:对于大规模数据集,增量管道的计算成本可能低于批处理管道,因为实际发生的计算量可以最小化。
  • 对数据规模变化的弹性较高。由于只处理新数据,增量管道避免了在大部分数据未变化的大型数据集上重新进行计算的需要。

要了解更多关于增量管道的信息,请参考以下资源:

流式

流式管道中,您的代码持续运行以处理任何流入 Foundry 的新数据,从而实现最低的延迟,但也会带来最高的复杂性和计算成本。通常,将流式管道视为更接近于管理微服务而非管理计算作业是有帮助的——您需要非常关注正常运行时间、弹性和有状态操作,才能成功运行流式管道。

根据上述标准审视流式管道:

  • 延迟:流式管道的延迟可以做到非常低。如果您有强烈需求要求数据在不到一分钟内可用于用例,那么流式管道可能是一个不错的选择。
  • 复杂性:流式管道的复杂性非常高。编写这些管道需要避免一系列模式,例如有状态操作,这些操作可能会在未来意外导致不稳定。此外,流式管道对故障的容忍度通常低于批处理或增量管道,因为停机可能导致依赖持续新鲜数据的用例将任何中断视为故障。
  • 计算成本:流式管道的计算成本可能非常高,因为流式管道的性质要求计算资源始终可用以处理新的输入数据。
  • 对数据规模变化的弹性较高。流式管道旨在支持高吞吐量,通常能够比批处理甚至增量管道处理更高的数据规模变化。

在大多数情况下,除非您的用例有非常低的延迟要求,否则最好避免创建流式管道。增量管道通常可以实现分钟级的端到端延迟,以满足大多数需求,而无需承担流式管道带来的额外复杂性和计算成本。

要了解更多关于流式管道的信息,请参考以下资源:

流式管道的其他文档即将推出。如果您有兴趣构建流式管道,请联系您的 Palantir 代表。