Optimizing pipelines(优化流水线)¶
In the course of creating data pipelines in Foundry, you may run into cases where it is necessary to understand the details of how computation works behind the scenes in order to effectively debug job failures or improve compute performance. In general, you should follow these steps when you encounter unexpected compute issues or performance problems.
Note that if your pipeline is a batch pipeline, you may be able to speed up some compute jobs by making better use of the Spark engine that underlies computation in Foundry. However, this sort of performance tuning has limits. If your pipeline inputs are growing rapidly over time, you may need to adapt your pipeline to be incremental instead, to only process the rows or files of data that are actually changing.
If you want to start by debugging a job or end-to-end pipeline that is failing unexpectedly, refer to these guides:
If you are interested in understanding how computation works in Foundry under the hood, begin by exploring the Spark core concepts.
中文翻译¶
优化流水线¶
在Foundry中创建数据流水线的过程中,您可能会遇到需要了解计算在后台如何运作的细节,以便有效调试任务故障或提升计算性能。通常情况下,当您遇到意外的计算问题或性能问题时,应遵循以下步骤。
请注意,如果您的流水线是批处理流水线,您可以通过更好地利用Foundry底层计算所依赖的Spark引擎来加速某些计算任务。然而,这种性能调优存在局限性。如果您的流水线输入随时间快速增长,您可能需要将流水线改为增量流水线,仅处理实际发生变化的数据行或文件。
如果您希望从调试意外失败的任务或端到端流水线开始,请参考以下指南:
如果您有兴趣了解Foundry中计算的底层工作原理,请从探索Spark核心概念开始。