Optimize multi-output transforms（优化多输出转换(Optimize multi-output transforms)）¶

There are two ways to derive a set of output datasets from a set of input datasets with transforms:

Make multiple transforms with the same inputs and different outputs.
Define a multi-output transform that takes multiple inputs and produces multiple outputs.

A hybrid approach is also possible, where you have multi-output transforms, each having the same inputs and different sets of outputs. These options are explored in the following section.

Multiple single-output transforms¶

A single output transform has X1, X2, ..., Xn inputs and produces one output Y. To obtain multiple outputs Y1, Y2, ..., Yn, write multiple transforms taking these X inputs, each writing to a different output. Each output has its own transform.

Benefits of multiple single-output transforms¶

The following is a list that describes the advantages of having multiple single-output transforms.

Logic for each output is contained in different transforms, making it easier to maintain.
If each output has different inputs, we can attribute the minimal set of inputs required for each transform. This can make builds quicker and allows you to separate inputs with markings to limit propagation, if necessary.
You can build each output independently.
You do not need to wait for all outputs to finish to rerun one output. This is useful if you want to build transforms at varying frequencies.
You can build single outputs without building all outputs, which may become expensive.
You can customize Spark profiles for each output, which is useful if they vary in compute costs.
You can write to different projects. This is possible because you can move transforms to different repositories. Note that this can also be achieved with multi-output transforms by adding a step to copy data to a different location.
You can use different libraries or frameworks for your transforms. For instance, some outputs could use lightweight transforms in Pandas, and others could use PySpark.
Failures are contained to single transforms, whether they are coming from the data, the code, or the platform.
Parallelization is performed between transforms, and within transforms.

Multiple single-output transforms limitations¶

Conversely, having having single-output transforms may also bring with it the following disadvantages.

There might be duplicate logic.
This means maintainability could be harder; you might want to extract duplicate logic into shared libraries.
This could mean redundant PySpark operations for each output. Depending on the context, you might want to save an intermediate dataset to avoid redundant Spark operations.
If output datasets are too dependent on each other, the duplicate logic may require many intermediate datasets. This would essentially translate into a multi-output transform.
Each transform will have its own Spark environment, including the driver and probably executors.
This compute sums up and if all outputs are run at the same time, some queuing might happen.
The overhead costs to run transforms are applied across all transforms. This includes Spark initialization and costs to run the drivers.
Even though Spark profiles are customizable per transform, in practice it is a management cost to fine tune each one, and is not always realistic to do so. If profiles are not customized, then outlier datasets of bigger scale will have fewer executors to work with than in multi-output transforms, leading to longer builds and potential timeouts.

Overall, this option comes with the most flexibility, but is less adapted to duplicate operations and might be more computationally expensive.

Multi-output transforms¶

A multi-output transform has X1, X2, ... Xn inputs and produces Y1, Y2, ... Yn outputs. We have a single transform for all outputs.

Benefits of multi-output transforms¶

Consider the advantages of using multi-output transforms below:

No repeated transform overhead costs.
A single driver and job.
This can become significant at high scale.
The logic for all outputs is in a single place.
Intermediate datasets are no longer required, and will be computed in memory.
No duplicate costs for redundant logic.

Multi-output transforms limitations¶

Consider the disadvantages of using multi-output transforms below:

If the logic differs too much for each output, it can get unorganized and become harder to maintain.
Each output cannot be built independently.
You need to wait for the previous build to finish all outputs before triggering a new one. You cannot customize the frequency of outputs within the transform.
A single set of Spark profiles is assigned to the transform. There are considerations to keep in mind given that a single driver will process all operations and attribute tasks to the executors.
This can be mitigated with dynamic allocation profiles, but these profiles will fill resource queues quickly.
Tasks that are not parallelizable will generate an overhead on the driver. This includes, but is not limited to collecting to the driver, running user-defined functions (UDFs), and running Python code (for example, calling APIs).
When the data scale is small compared to the number of executors, the overhead from network inputs and outputs may become more significant than the computational work performed by the executors.
A build failure affects all outputs.
If the operations of one output take more than 24 hours, or contain an error, it will trigger a failure of all outputs.
This can make it harder to debug issues.
Cost usage is split amongst all output datasets equally; if there are 10 outputs, then a tenth of the build's cost is associated with each dataset. This means it is not easy to identify the cost association for each output.
If the size of the input data varies, the build duration variance will be the sum of the variances introduced by all inputs. This can become significant for incremental transforms, where dynamic allocation profiles would be better suited.

Multi-output transforms are less flexible, but they are well suited for repeated logic in outputs.

Additional considerations¶

Things to consider when you are using multi-output transforms:

You can check if the task parallelization matches the number of executors in the Spark details. If builds take too long, or the Spark profile too big, consider splitting the transform.
The more executors you use for each transform, the less significant the overhead costs. If you are transforming small-scale datasets into a lot of outputs, then network input/output could be more significant than executor work for a multi-output transform. Instead, you could use multiple transforms or lightweight transforms that are parallelized locally.

Which approach to use¶

Single-output transforms are very flexible and well-suited to cases where the logic between outputs varies. Multi-output transforms are less flexible, but they can be more cost-effective under the right conditions. Generally, opt for multi-output transforms if you meet the following criteria:

The constraints of multi-output transforms satisfy your needs. Review multi-output transforms limitations to see if this option is right for your use case.
You have similar logic for the outputs.
Your operations are parallelizable.

If these conditions are met, you should opt for multi-output transforms. Otherwise, decide case by case while keeping in mind that multiple single-output transforms are the fallback option.

中文翻译¶

优化多输出转换(Optimize multi-output transforms)¶

有两种方法可以通过转换从一组输入数据集派生一组输出数据集：

创建多个具有相同输入和不同输出的转换。
定义一个多输出转换，接收多个输入并生成多个输出。

也可以采用混合方法，即使用多个多输出转换，每个转换具有相同的输入和不同的输出集。以下部分将探讨这些选项。

多个单输出转换(Multiple single-output transforms)¶

单输出转换接收 X1, X2, ..., Xn 输入并生成一个输出 Y。要获得多个输出 Y1, Y2, ..., Yn，需要编写多个接收这些 X 输入的转换，每个转换写入不同的输出。每个输出都有自己的转换。

多个单输出转换的优势¶

以下列表描述了多个单输出转换的优点：

每个输出的逻辑包含在不同的转换中，便于维护。
如果每个输出有不同的输入，可以为每个转换分配最少的必要输入集。这可以加快构建速度，并在必要时允许您分离带有标记的输入以限制传播。
您可以独立构建每个输出。
无需等待所有输出完成即可重新运行某个输出。这在您希望以不同频率构建转换时非常有用。
您可以只构建单个输出而无需构建所有输出，后者可能成本较高。
您可以为每个输出自定义Spark配置文件(Spark profiles)，这在计算成本不同时非常有用。
您可以写入不同的项目。这是因为您可以将转换移动到不同的代码仓库。请注意，通过添加将数据复制到不同位置的步骤，多输出转换也可以实现这一点。
您可以为转换使用不同的库或框架。例如，某些输出可以使用Pandas中的轻量级转换，而其他输出可以使用PySpark。
故障被限制在单个转换内，无论故障来自数据、代码还是平台。
并行化在转换之间以及转换内部进行。

多个单输出转换的局限性¶

相反，使用单输出转换也可能带来以下缺点：

可能存在重复逻辑。
这意味着可维护性可能更差；您可能需要将重复逻辑提取到共享库中。
这可能意味着每个输出都有冗余的PySpark操作。根据具体情况，您可能希望保存中间数据集以避免冗余的Spark操作。
如果输出数据集相互依赖过强，重复逻辑可能需要大量中间数据集。这实际上就相当于一个多输出转换。
每个转换都有自己的Spark环境，包括驱动程序(Driver)和可能执行器(Executors)。
这些计算资源会累加，如果所有输出同时运行，可能会发生排队。
运行转换的开销成本适用于所有转换。这包括Spark初始化和运行驱动程序的成本。
尽管每个转换都可以自定义Spark配置文件，但在实践中，微调每个配置的管理成本较高，并不总是现实可行。如果配置文件未自定义，那么规模较大的异常数据集在单输出转换中可用的执行器将少于多输出转换，导致构建时间更长并可能出现超时。

总体而言，此选项具有最大的灵活性，但不适用于重复操作，且计算成本可能更高。

多输出转换(Multi-output transforms)¶

多输出转换接收 X1, X2, ... Xn 输入并生成 Y1, Y2, ... Yn 输出。所有输出共享一个转换。

多输出转换的优势¶

请考虑以下使用多输出转换的优点：

没有重复的转换开销成本。
单个驱动程序和作业。
在大规模场景下，这一点可能变得非常重要。
所有输出的逻辑集中在一个地方。
不再需要中间数据集，它们将在内存中计算。
没有冗余逻辑的重复成本。

多输出转换的局限性¶

请考虑以下使用多输出转换的缺点：

如果每个输出的逻辑差异过大，可能会变得杂乱且难以维护。
每个输出无法独立构建。
您需要等待前一次构建完成所有输出后才能触发新的构建。无法自定义转换内各输出的频率。
转换为分配单一的Spark配置文件集。考虑到单个驱动程序将处理所有操作并将任务分配给执行器，需要牢记以下几点：
这可以通过动态分配配置文件(Dynamic allocation profiles)来缓解，但这些配置文件会快速填满资源队列。
不可并行化的任务会在驱动程序上产生开销。这包括但不限于收集到驱动程序、运行用户自定义函数(UDFs)以及运行Python代码（例如调用API）。
当数据规模相对于执行器数量较小时，网络输入输出的开销可能比执行器执行的计算工作更为显著。
构建失败会影响所有输出。
如果某个输出的操作耗时超过24小时，或包含错误，将导致所有输出失败。
这可能使调试问题变得更加困难。
成本使用在所有输出数据集之间平均分配；如果有10个输出，则每个数据集关联十分之一的构建成本。这意味着很难识别每个输出的成本关联。
如果输入数据的大小发生变化，构建时长的方差将是所有输入引入方差的总和。这对于增量转换(Incremental transforms)可能变得显著，而动态分配配置文件在这种情况下更为适用。

多输出转换灵活性较低，但非常适合输出中存在重复逻辑的场景。

其他考虑因素¶

使用多输出转换时需要考虑的事项：

您可以检查任务并行化是否与Spark详情中的执行器数量匹配。如果构建时间过长，或Spark配置文件过大，请考虑拆分转换。
每个转换使用的执行器越多，开销成本就越不显著。如果您将小规模数据集转换为大量输出，那么多输出转换中的网络输入输出可能比执行器工作更为显著。这种情况下，您可以改用多个转换或在本地并行化的轻量级转换。

选择哪种方法¶

单输出转换非常灵活，适用于输出之间逻辑各异的场景。多输出转换灵活性较低，但在适当条件下可能更具成本效益。通常，如果您满足以下条件，请选择多输出转换：

多输出转换的约束条件满足您的需求。请查看多输出转换的局限性以确定此选项是否适合您的用例。
各输出具有相似的逻辑。
您的操作是可并行化的。

如果满足这些条件，您应该选择多输出转换。否则，请根据具体情况逐案决定，同时牢记多个单输出转换是备选方案。