Computation modes for batch input datasets（批处理输入数据集的计算模式）¶

You can choose to read your input dataset as snapshot or incremental, depending on your use case.

Snapshot computation¶

Snapshot computation performs transforms over the entire input, not just newly-added data. The output dataset is fully replaced by the latest pipeline output every build.

Example of snapshot computation

Best used when:

The input dataset is not updating via APPEND transactions.
When the input is written using SNAPSHOT transactions, incrementally reading the input is not possible.
The output dataset cannot update via APPEND transactions.
Example: The entire output dataset is subject to change with each run, requiring snapshot outputs.
The input dataset is small.
Snapshot computation is similarly efficient to incremental computation in this case.

Incremental computation¶

Incremental computation performs transforms only on new data that has been appended to the selected input since the last build. This can reduce compute resources, but comes with important restrictions.

:::callout{theme="warning"} A pipeline will only run with incremental computation if the selected input dataset changes through APPEND or UPDATE transactions that do not modify existing files. Marking a snapshot input as incremental will have no effect. :::

Example of incremental computation

Best used when:

The input dataset changes via APPEND transactions or additive UPDATE transactions.
This indicates the previous output stays the same as new data is added. Incremental computation cuts down the amount of data processed with each build.
You do not need to reference the previous output.
Pipeline Builder currently disallows read mode: previous. However, some output write modes support common use cases of this read mode. Learn more about output write modes in Pipeline Builder.
The input dataset is large and new data is often added.
Incremental builds can save compute resources and time and lead to performance benefits.

Incremental computation restrictions¶

:::callout{theme="warning"} This section outlines restrictions that might be applicable to your workflow. Review prior to incremental computation setup to ensure proper implementation. :::

Joins: With joins involving an incremental dataset, the incremental dataset must be on the left side of the join and the snapshot dataset on the right side. Joins between two incremental datasets are also supported.
Snapshot inputs in joins: If a snapshot input receives a new transaction, any downstream joins that also involve an incremental dataset will continue to run incrementally. Pipeline Builder does not support using a change in the snapshot input on the right side of a join to force a replay of the pipeline.
Unions: All inputs to a union must use the same computation mode (either all snapshot or all incremental).
Transforms: Transforms that may change the previous output are limited to the current transaction. Window functions, aggregations, and pivots apply only on the current transaction of data, not the previous output.
Replays: If your pipeline logic has changed and you would like to apply the new logic to previously processed input transactions, you may choose to replay on deploy. Only replays over the entire input are supported.

For more information, see an example of incremental computation in Pipeline Builder.

中文翻译¶

批处理输入数据集的计算模式¶

您可以根据具体用例选择以快照(Snapshot)或增量(Incremental)方式读取输入数据集。

快照计算(Snapshot computation)¶

快照计算会对整个输入数据执行转换，而不仅仅是新增的数据。每次构建时，输出数据集都会被最新的流水线输出完全替换。

快照计算示例

最佳适用场景：

输入数据集不是通过APPEND事务进行更新。
当输入使用SNAPSHOT事务写入时，无法增量读取该输入。
输出数据集无法通过APPEND事务进行更新。
例如：每次运行时整个输出数据集都可能发生变化，因此需要快照输出。
输入数据集较小。
在这种情况下，快照计算与增量计算的效率相当。

增量计算(Incremental computation)¶

增量计算仅对自上次构建以来追加到所选输入中的新数据执行转换。这可以减少计算资源消耗，但存在重要的限制。

:::callout{theme="warning"} 仅当所选输入数据集通过不修改现有文件的APPEND或UPDATE事务发生变化时，流水线才会以增量计算模式运行。将快照输入标记为增量将不会产生任何效果。 :::

增量计算示例

最佳适用场景：

输入数据集通过APPEND事务或增量式UPDATE事务发生变化。
这表明随着新数据的添加，之前的输出保持不变。增量计算可减少每次构建处理的数据量。
您不需要引用之前的输出。
Pipeline Builder目前不允许使用读取模式：previous。不过，某些输出写入模式支持该读取模式的常见用例。了解Pipeline Builder中的输出写入模式。
输入数据集较大且经常添加新数据。
增量构建可以节省计算资源和时间，并带来性能优势。

增量计算限制¶

:::callout{theme="warning"} 本节概述了可能适用于您工作流的限制。在设置增量计算之前请仔细阅读，以确保正确实施。 :::

连接(Joins)： 对于涉及增量数据集的连接，增量数据集必须位于连接的左侧，快照数据集位于右侧。两个增量数据集之间的连接也受支持。
连接中的快照输入： 如果快照输入接收到新事务，任何同时涉及增量数据集的下游连接将继续以增量方式运行。Pipeline Builder不支持利用连接右侧快照输入的变化来强制重放流水线。
并集(Unions)： 并集的所有输入必须使用相同的计算模式（全部为快照或全部为增量）。
转换(Transforms)： 可能改变之前输出的转换仅限于当前事务。窗口函数、聚合和数据透视表仅适用于当前事务的数据，而不适用于之前的输出。
重放(Replays)： 如果流水线逻辑已更改，并且您希望将新逻辑应用于之前处理的输入事务，您可以选择在部署时进行重放。仅支持对整个输入进行重放。

更多信息，请参见Pipeline Builder中的增量计算示例。