Comparison: Streaming vs batch(对比:流式处理与批处理)¶
When deciding whether to process data in Foundry as a stream or as a batch dataset, it is important to understand the requirements of your particular use case. We recommend considering the feature, performance, and technical expectations of both streaming and batch pipelines to ensure you choose the tool that works best for your use case.
Generally, streaming is used for workflows that require low end-to-end latency. For uses cases that can tolerate more than ten minutes of latency, incremental or standard batch datasets may also be suitable solutions.
Feature considerations¶
Streaming and batch pipelines share many common features. The primary differences to consider are latency, compute cost, and supported transform languages. The table below shows the features available for streaming and batch pipelines.
| Feature | Streaming | Batch |
|---|---|---|
| Higher latency data usage for all Foundry products | Yes | Yes |
| Transactional | Yes | Yes |
| Atomic | Yes | Yes |
| Branching | Yes | Yes |
| Security markings and classifications | Yes | Yes |
| Provenance tracking | Yes | Yes |
| Ontology support | Yes | Yes |
| Time series support | Yes | Yes |
| Java supported transforms | Yes | Yes |
| Pipeline Builder support | Yes | Yes |
| Python supported transforms | No | Yes |
| Low latency data access | Yes | No |
Front-end tools¶
Some Foundry front-end tools can consume a streaming dataset in "real time", such as live updating a graph. The tools that can natively consume streams are Ontology, Pipeline Builder, Quiver, Dataset Preview, and Foundry Rules. Other apps in Foundry, like Contour, can consume the stream's archival dataset. This is a standard batch dataset that updates with the stream's newest data every few minutes. In practice, you can select the streaming dataset in any app you want, and Foundry will automatically know which mode to use.
Performance considerations¶
Along with feature considerations, consider the different performance expectations between streaming and batch pipelines. These factors include latency and throughput. Review streaming latency and throughput considerations for additional details.
Technical considerations¶
In addition to the feature and performance factors that define streaming and batch processing, consider technical factors including state management, downtime impact, and consumption layer latency.
State management¶
In contrast to batch transforms, where the size of inputs is bounded and known ahead of time, stateful streaming ↗ applications may have unbounded state that can grow over time and result in an out of memory error at an unknown point in the future. As an example, performing an aggregation over one or more keys is generally a dangerous operation if the size of the key space is unbounded. Unlike batch compute, it is usually difficult to anticipate and provision sufficient resources. For this reason, and due to the temporal nature of most streaming applications, ensure state growth is not unbounded when designing streaming transforms.
Low latency¶
Low latency is usually an important requirement of a streaming pipeline, but achieving low latency is not always a clear process. Since streaming pipelines typically have multiple steps, latency is bounded by the highest latency across all layers. Therefore, achieving low latency may involve fine-tuning multiple applications and consumption layers, from the source system to the streaming application and downstream consumers.
Downtime impact¶
Since batch pipelines are not typically used for operational workflows with strict low latency requirements, they tend to have a higher tolerance to downtime compared with streaming pipelines. Batch pipelines usually run on a fixed schedule; an individual run failure is not typically a major issue and the build can be retried. In contrast, streaming pipelines are continuously running processes with no defined endpoint, and failures are often more impactful. A 10 minute outage of a data store holding real time data has a much higher impact than a 10 minute outage of a data store with daily updates.
Consumption layer latency¶
Because batch pipelines are usually held up by transforms rather than reading or writing data, almost any consumption layer can be used. For streaming pipelines, consumption layers that natively support fast, frequent writes are required to maintain low latency. This means that each application in the end-to-end pipeline must be able to process data with low latencies to ensure low end-to-end latency. Most Foundry products support this natively, including Pipeline Builder, time series, Foundry Rules, and the Ontology. Data connectors also exist to support low latency writes to external systems.
Cost considerations¶
Due to the high-throughput, low-latency computation, the average cost of a streaming pipeline is often higher than the average cost of a batch pipeline. However, for a streaming-shaped pipeline, which requires low-latency processing of data, streaming could be the most cost-effective solution compared to a continuously running batch pipeline.
Review Compute usage with Foundry Streaming for more information on how streaming costs are computed.
中文翻译¶
对比:流式处理与批处理¶
在决定是否以流式或批处理数据集的方式处理 Foundry 中的数据时,理解特定用例的需求至关重要。我们建议同时考虑流式管道和批处理管道的功能、性能和技术预期,以确保选择最适合您用例的工具。
通常,流式处理用于需要低端到端延迟的工作流。对于能够容忍超过十分钟延迟的用例,增量式或标准批处理数据集也可能是合适的解决方案。
功能考量¶
流式管道和批处理管道共享许多常见功能。主要区别在于延迟、计算成本和支持的转换语言。下表显示了流式管道和批处理管道的可用功能。
| 功能 | 流式处理 | 批处理 |
|---|---|---|
| 所有 Foundry 产品支持更高延迟的数据使用 | 是 | 是 |
| 事务性(Transactional) | 是 | 是 |
| 原子性(Atomic) | 是 | 是 |
| 分支(Branching) | 是 | 是 |
| 安全标记和分类 | 是 | 是 |
| 溯源追踪(Provenance tracking) | 是 | 是 |
| 本体论(Ontology)支持 | 是 | 是 |
| 时间序列(Time series)支持 | 是 | 是 |
| Java 支持的转换 | 是 | 是 |
| Pipeline Builder 支持 | 是 | 是 |
| Python 支持的转换 | 否 | 是 |
| 低延迟数据访问 | 是 | 否 |
前端工具¶
某些 Foundry 前端工具可以"实时"消费流式数据集,例如实时更新图表。能够原生消费流数据的工具包括本体论(Ontology)、Pipeline Builder、Quiver、数据集预览(Dataset Preview)和Foundry Rules。Foundry 中的其他应用(如 Contour)可以消费流的归档数据集。这是一个标准的批处理数据集,每隔几分钟就会用流的最新数据进行更新。实际上,您可以在任何应用中选择流式数据集,Foundry 会自动判断使用哪种模式。
性能考量¶
除了功能考量外,还需考虑流式管道和批处理管道之间不同的性能预期。这些因素包括延迟和吞吐量。请参阅流式延迟和吞吐量考量了解更多详情。
技术考量¶
除了定义流式处理和批处理的功能和性能因素外,还需考虑技术因素,包括状态管理、停机影响和消费层延迟。
状态管理¶
与批处理转换不同(其输入大小是已知且有界的),有状态流处理 ↗ 应用可能具有无界状态,这种状态会随时间增长,并在未来某个未知时间点导致内存溢出错误。例如,对一个或多个键执行聚合操作通常是一种危险的操作,如果键空间的大小是无界的。与批处理计算不同,通常很难预测和配置足够的资源。因此,考虑到大多数流式应用的时间特性,在设计流式转换时,请确保状态增长不是无界的。
低延迟¶
低延迟通常是流式管道的重要要求,但实现低延迟并不总是一个清晰的过程。由于流式管道通常包含多个步骤,延迟受限于所有层中的最高延迟。因此,实现低延迟可能涉及对多个应用和消费层进行微调,从源系统到流式应用再到下游消费者。
停机影响¶
由于批处理管道通常不用于具有严格低延迟要求的操作工作流,因此与流式管道相比,它们对停机的容忍度更高。批处理管道通常按固定计划运行;单次运行失败通常不是大问题,可以重试构建。相比之下,流式管道是持续运行且没有定义终点的进程,故障的影响往往更大。持有实时数据的数据存储中断 10 分钟,比持有每日更新数据的数据存储中断 10 分钟的影响要大得多。
消费层延迟¶
由于批处理管道通常受转换而非数据读取或写入的限制,因此几乎可以使用任何消费层。对于流式管道,需要原生支持快速、频繁写入的消费层来维持低延迟。这意味着端到端管道中的每个应用都必须能够以低延迟处理数据,以确保低端到端延迟。大多数 Foundry 产品都原生支持这一点,包括 Pipeline Builder、时间序列、Foundry Rules 和本体论(Ontology)。此外,还存在支持向外部系统进行低延迟写入的数据连接器。
成本考量¶
由于高吞吐量、低延迟的计算需求,流式管道的平均成本通常高于批处理管道的平均成本。然而,对于需要低延迟数据处理且适合流式处理的管道,与持续运行的批处理管道相比,流式处理可能是最具成本效益的解决方案。
请参阅使用 Foundry Streaming 的计算用量了解有关流式处理成本计算方式的更多信息。