Maintaining high performance(保持高性能)¶
Incremental transforms can help minimize resource use by ensuring that only new rows are processed whenever a pipeline updates. In Foundry, incremental transforms can only add new rows to the dataset and cannot replace or remove rows. This may be the case for data ingests as well. Since each run typically processes a small amount of data, each of these additions are small, but all of them must be considered together when rendering the active view.
Typically this is fine, but when tens or hundreds of thousands of small updates have accumulated (which can happen after weeks or months of uninterrupted building, depending on frequency), read and write performance may begin to degrade. This may be caused by several related factors, as discussed below.
Possible causes¶
Large file counts¶
Having too many files in a dataset causes slowness in backend requests that access the current view’s data. This can cause preview unavailability issues, failures in Contour, and severe slowness in Transforms, especially Python Transforms, due to special overhead associated with loading files in that language environment.
Over time, any operation involving the active dataset view will be dominated by inefficient data reads, so much that transforms that initially run in the order of minutes will end up taking several hours or days. In extreme cases, backend requests may even time out, causing builds to fail.
Additionally, as part of writing a Spark DataFrame to an output in an APPEND or UPDATE transaction, an incremental transform needs to construct an in-memory list of all of the existing files in the output's current view. For outputs with many files, this operation can take a long time and may require the use of a high driver memory Spark profile to avoid out-of-memory errors.
Large transaction counts¶
Similarly to having too many files, too many transactions in the view can cause slowness. However, this slowness occurs not only when files are requested, but whenever the view range is resolved, which can affect dataset functionality such as computing stats, loading history, and so on. In transforms, this slowness will manifest in the pre-compute stage, when the environment is being set up and before any Spark details are available, making the issue difficult to debug.
Usually transactions accumulate in tandem with files, but in some cases empty transactions can cause issues even with a relatively low file count. Conversely, even a relatively low number of transactions may result in a large number of files if each transaction contains many files.
Possible solutions¶
Regular snapshotting¶
The simplest and, in some situations, most effective solution to keep incremental pipelines performant is to run regular snapshots. Snapshots will re-process the full data and replace the output with a fresh view, which can be partitioned efficiently. Snapshotting is a compute-intensive operation, which is why incremental processing is often preferred over regular processing, but occasional snapshotting can help prevent an excess of files or transactions from building up in the same view. To determine snapshotting frequency, administrators should balance performance for reading and writing data.
Snapshot transactions have cascading effects, in which each incremental transform that uses an input that has been snapshotted will snapshot itself. Rather than directly forcing an incremental transform to run a snapshot, we just need to snapshot one of the transform's inputs and let the Transforms application determine that the incremental transform needs snapshotting as well. Thus, a useful way to set up an incremental pipeline to regularly snapshot is to set up a dummy input that will build on the intended snapshot schedule, and make the dummy input always run in snapshot mode.
Regular snapshotting has several drawbacks, as described below:
- Snapshotting can complicate the lineage, since dummy inputs will appear in searches and in the folder structure, which may be confusing.
- Snapshotting can be very expensive, even if done infrequently.
- While a snapshot is running, no updates can propagate through the pipeline, which can interfere with SLAs (service-level agreements).
- Snapshotting does not work for raw datasets, such as those backed by a Data Connection sync.
- Not every transform is written in a way that supports snapshotting (though we recommend that they always should).
Dataset projections¶
The most recommended way of dealing with incremental buildup is to add a dataset projection to the datasets involved. Dataset projections offer an alternate mechanism for querying for data in a dataset, and are stored and updated independently from the canonical dataset. Because of these traits, dataset projections can break out of the append-only model of incremental computation, and reorganize their internal data representation automatically as the volume of data grows. This is called compaction—compaction ensures reads are always performant from the projection, no matter how many files or transactions are in the canonical dataset.
This is particularly useful for incremental pipelines because dataset projections do not need to be completely in sync with the canonical dataset for readers to benefit from improved performance. All Foundry products know how to combine data coming from the projection and from the canonical dataset to reconstruct the view the same as if the projection wasn’t there. For example, if a dataset has 100 incremental transactions, and has a projection that was built with the first 99, then 99 will be read from the projection and only one from the dataset. Because of this, it is usually sufficient to update dataset projections daily or weekly, making them very computationally cheap to maintain.
Note that because the dataset projection is a separate resource from the canonical dataset, it can be built at any time, even if the canonical dataset is itself building (and the other way around). Readers will just use whatever state is current according to the valid transactions. For example, if a dataset is building transaction 10 and the projection starts building at the same time, it will read from transaction 9. A reader that queries data in this scenario will read transactions 1-8 from the projection, and 9 from the dataset, effectively seeing the same data as if reading from the dataset directly.
Drawbacks of dataset projections include:
- Dataset projections use more storage than just the dataset alone would.
- Dataset projections do not work out of the box with Foundry Retention.
- Dataset projections do not modify the canonical dataset in any way (so if the projection is removed, reads will become inefficient again).
Retention policies¶
Sometimes a pipeline’s use case does not require keeping historical data in-platform forever, and it’s fine to retain only the most recent transactions, which can be done automatically using Foundry Retention. In this case incremental pipelines can be built without special consideration, as long as the transforms logic doesn’t include any cross-transactional dependencies (such as aggregations or differential computation). A special allow_retention flag must be set in Python Transforms to the incremental decorator (otherwise DELETE transactions will trigger a snapshot run).
Drawbacks of retention policy changes include:
- Loss of historical data.
- If transactions aren’t self-contained units from a data perspective, retention policies may lead to inconsistent state (e.g. an end event without a matching start).
Additional options to consider¶
Aborting transactions¶
Under some circumstances, transactions will be committed with zero files, or only with empty files. These transactions have no impact on the view, but they are considered valid updates and will trigger schedules and all related side effects, which may result in wasted computation. Empty transactions can also greatly increase the file and transaction counts.
Empty transactions are best avoided at the source, which is generally a Data Connection sync. Data Connection will always abort transactions with no files, but empty files can still be generated. Empty transactions can be a particular issue with custom plugins; at times it may not be possible to modify the plugin to avoid empty transactions (for instance, if a non-empty transaction is required to update the Data Connection incremental metadata). In other cases, no-file transactions can be committed by transforms or other means.
To minimize the impact of these empty or no-file transactions, we can explicitly abort transactions at the most upstream transform in the pipeline. When we detect that we receive an empty input, or end up with an empty output, we can call .abort() on the output object; this will cause the job to be aborted, along with its transaction. Aborted transactions are effectively cancelled and will not trigger schedules or cause any side effects. Aborting the transaction will break the chain and stop propagating empty transactions down the pipeline. Aborting the transaction will also not contribute failure statistics (whereas purposefully failing the build would contribute to failure statistics).
Note that aborted builds are considered successful, and will advance the input transaction pointers. Therefore, aborting with non-empty inputs will discard data, which may not be desired if you want to stop the build for another reason, like a failed precondition.
Changelog datasets¶
Changelog logic enables you to implement edit semantics on append-only transactions, making it possible to perform joins and aggregations reliably in incremental pipelines. However, besides the previously-mentioned file and transaction count problems, implementing edit semantics on append-only transactions may allow the row count to grow without bounds, making transforms performance increasingly worse at the point where a state resolution stage is reached (or a standalone snapshot required).
Keeping the row count under control for such pipelines is a little trickier. Snapshotting is possible, and may help a lot in intermediate transforms when partial states are present (since those mostly go away in a full rebuild). But to fully benefit from snapshots, the logic must collapse rows to their latest state, which is not always desirable (in some cases we may want to figure out the state of rows at any given point in time, not just the latest). Dataset projections can only help so much. And retention policies may not have the desired effect if rows aren’t atomic units (for example, we may end up with an end event without a matching start event). So special care must be taken when designing such pipelines.
中文翻译¶
保持高性能¶
增量转换(Transform)有助于最小化资源使用,确保每次管道更新时只处理新行。在Foundry中,增量转换只能向数据集添加新行,不能替换或删除行。数据摄取(Ingest)也可能存在这种情况。由于每次运行通常处理少量数据,每次添加的量都很小,但在渲染活动视图(Active View)时必须将所有添加的数据一起考虑。
通常情况下这没有问题,但当积累了数万或数十万次小更新时(根据更新频率,可能在连续构建数周或数月后发生),读写性能可能会开始下降。这可能是由以下几个相关因素造成的,如下所述。
可能的原因¶
大量文件¶
数据集中文件过多会导致访问当前视图数据的后端请求变慢。这可能导致预览不可用、Contour失败以及Transforms(尤其是Python Transforms)严重变慢,因为在该语言环境中加载文件会产生特殊开销。
随着时间的推移,任何涉及活动数据集视图的操作都将受限于低效的数据读取,以至于最初只需几分钟的转换最终需要数小时甚至数天。在极端情况下,后端请求甚至可能超时,导致构建失败。
此外,作为将Spark DataFrame写入APPEND或UPDATE事务输出的一部分,增量转换需要构建输出当前视图中所有现有文件的内存列表。对于包含大量文件的输出,此操作可能需要很长时间,并且可能需要使用高驱动内存(Driver Memory)的Spark配置文件来避免内存溢出错误。
大量事务¶
与文件过多类似,视图中事务(Transaction)过多也会导致变慢。然而,这种变慢不仅发生在请求文件时,还发生在解析视图范围时,这会影响数据集功能,如计算统计信息、加载历史记录等。在Transforms中,这种变慢会表现在预计算阶段,即环境设置期间且在Spark详细信息可用之前,这使得问题难以调试。
通常事务与文件会同时累积,但在某些情况下,即使文件数量相对较少,空事务也可能导致问题。反之,即使事务数量相对较少,如果每个事务包含大量文件,也可能导致大量文件。
可能的解决方案¶
定期快照¶
保持增量管道高性能的最简单且在某些情况下最有效的解决方案是定期运行快照(Snapshot)。快照将重新处理全部数据,并用新的视图替换输出,这样可以高效地进行分区。快照是计算密集型操作,这就是为什么增量处理通常优于常规处理,但偶尔进行快照可以防止同一视图中积累过多的文件或事务。为确定快照频率,管理员应平衡读写数据的性能。
快照事务具有级联效应,即使用已快照输入的每个增量转换都会自行快照。我们无需直接强制增量转换运行快照,只需快照该转换的某个输入,然后让Transforms应用程序确定该增量转换也需要快照。因此,设置增量管道定期快照的一个有用方法是创建一个虚拟输入(Dummy Input),该输入按计划的快照时间表构建,并使其始终以快照模式运行。
定期快照有以下几个缺点:
- 快照可能使数据沿袭(Lineage)复杂化,因为虚拟输入会出现在搜索和文件夹结构中,这可能令人困惑。
- 即使不频繁执行,快照也可能非常昂贵。
- 快照运行时,无法通过管道传播更新,这可能会影响SLA(服务级别协议)。
- 快照不适用于原始数据集(Raw Dataset),例如由Data Connection同步支持的数据集。
- 并非所有转换都支持快照(尽管我们建议它们始终支持)。
数据集投影¶
处理增量累积最推荐的方法是为相关数据集添加数据集投影(Projection)。数据集投影提供了查询数据集中数据的替代机制,并且独立于规范数据集(Canonical Dataset)进行存储和更新。由于这些特性,数据集投影可以突破增量计算的仅追加模型,并在数据量增长时自动重组其内部数据表示。这称为压缩(Compaction)——压缩确保无论规范数据集中有多少文件或事务,从投影读取数据时始终保持高性能。
这对增量管道特别有用,因为数据集投影无需与规范数据集完全同步,读者就能受益于性能提升。所有Foundry产品都知道如何结合来自投影和规范数据集的数据,以重建与没有投影时相同的视图。例如,如果一个数据集有100个增量事务,并且有一个基于前99个事务构建的投影,那么99个事务将从投影读取,只有1个从数据集读取。因此,通常每天或每周更新数据集投影就足够了,这使得维护它们的计算成本非常低。
请注意,由于数据集投影是与规范数据集分开的资源,它可以随时构建,即使规范数据集本身正在构建(反之亦然)。读者将根据有效事务使用当前状态。例如,如果数据集正在构建事务10,而投影同时开始构建,它将从事务9读取。在这种情况下查询数据的读者将从投影读取事务1-8,从数据集读取事务9,实际上看到的数据与直接从数据集读取相同。
数据集投影的缺点包括:
- 数据集投影比单独使用数据集消耗更多存储空间。
- 数据集投影不能与Foundry Retention开箱即用。
- 数据集投影不会以任何方式修改规范数据集(因此如果删除投影,读取将再次变得低效)。
保留策略¶
有时管道的用例不需要在平台中永久保留历史数据,只保留最近的事务即可,这可以使用Foundry Retention自动完成。在这种情况下,只要转换逻辑不包含任何跨事务依赖关系(如聚合或差分计算),就可以构建增量管道而无需特殊考虑。在Python Transforms中,必须为增量装饰器设置特殊的allow_retention标志(否则DELETE事务将触发快照运行)。
保留策略变更的缺点包括:
- 历史数据丢失。
- 如果事务从数据角度来看不是自包含单元,保留策略可能导致状态不一致(例如,结束事件没有匹配的开始事件)。
其他需要考虑的选项¶
中止事务¶
在某些情况下,事务会提交零个文件或仅提交空文件。这些事务对视图没有影响,但它们被视为有效更新,会触发调度和所有相关副作用,可能导致计算浪费。空事务还可能大大增加文件和事务数量。
最好在源头避免空事务,这通常是指Data Connection同步(Sync)。Data Connection总是会中止没有文件的事务,但仍可能生成空文件。空事务可能是自定义插件的一个特殊问题;有时可能无法修改插件以避免空事务(例如,如果需要非空事务来更新Data Connection增量元数据)。在其他情况下,无文件事务可能由转换或其他方式提交。
为最小化这些空或无文件事务的影响,我们可以在管道中最上游的转换中显式中止事务。当我们检测到收到空输入或最终输出为空时,可以在输出对象上调用.abort();这将导致作业及其事务被中止。被中止的事务实际上被取消,不会触发调度或引起任何副作用。中止事务将中断链条,阻止空事务沿管道传播。中止事务也不会计入失败统计(而故意使构建失败则会计入失败统计)。
请注意,中止的构建被视为成功,并且会推进输入事务指针。因此,在非空输入的情况下中止将丢弃数据,如果您因其他原因(如前置条件失败)想要停止构建,这可能不是期望的行为。
变更日志数据集¶
变更日志(Changelog)逻辑使您能够在仅追加事务上实现编辑语义,从而可以在增量管道中可靠地执行连接和聚合操作。然而,除了前面提到的文件和事务数量问题外,在仅追加事务上实现编辑语义可能导致行数无限增长,使得转换性能在达到状态解析阶段(或需要独立快照时)变得越来越差。
对于此类管道,控制行数稍微棘手一些。快照是可行的,并且在存在部分状态的中间转换中可能大有帮助(因为完整重建时这些状态大多会消失)。但要充分利用快照,逻辑必须将行折叠到其最新状态,这并不总是可取的(在某些情况下,我们可能希望了解行在任何给定时间点的状态,而不仅仅是最新状态)。数据集投影的帮助有限。如果行不是原子单元,保留策略可能无法达到预期效果(例如,我们可能最终得到一个没有匹配开始事件的结束事件)。因此,在设计此类管道时必须特别小心。