Accelerate Spark with Velox（使用 Velox 加速 Spark）¶

Spark acceleration is a technique that leverages low-level hardware optimizations to improve the performance of Spark jobs. By using platform-specific features, native acceleration aims to significantly reduce the time it takes to process large-scale data workloads, which can result in faster job execution and improved resource utilization.

Velox ↗ is a reusable, high-performance, low-level data processing library that provides a set of primitives for building high-performance data processing systems. It is designed to be used as a foundation for building higher-level data processing systems, and it is used in Foundry to accelerate Spark jobs.

Quick start¶

Spark acceleration can be used on any existing Spark pipeline. You do not need to make any changes to your logic.

To use native acceleration in your Python transform pipeline, you must complete the following:

Upgrade your Python repository to the latest version.
Configure an off-heap memory profile.
Enable the VELOX backend, as shown in the following code snippet:

from transforms.api import configure, ComputeBackend, Input, Output, transform_df


@configure(
    ["EXECUTOR_MEMORY_MEDIUM", "EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH"],
    backend=ComputeBackend.VELOX)
@transform_df(
    Output('/Project/folder/output'),
    source_df=Input('/Project/folder/input'),
)
def compute(source_df):
    ...

Configure memory for accelerated Spark¶

To optimize your natively accelerated Spark project, start by using the EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH setting for off-heap memory. This memory is used by Velox, which handles some tasks outside the JVM. Observe the performance, and adjust the off-heap memory up or down as needed.

:::callout{neutral} To use a fractional off-heap profile, you must also set an EXECUTOR_MEMORY_X profile. Your job likely already has this. :::

中文翻译¶

使用 Velox 加速 Spark¶

Spark 加速是一种利用底层硬件优化技术来提升 Spark 作业性能的方法。通过使用特定平台的特性，原生加速旨在显著缩短处理大规模数据工作负载所需的时间，从而实现更快的作业执行和更高的资源利用率。

Velox ↗ 是一个可复用的高性能底层数据处理库，提供了一组用于构建高性能数据处理系统的原语。它被设计为构建更高级数据处理系统的基础，并在 Foundry 中用于加速 Spark 作业。

了解更多关于 Foundry 中原生加速的信息。

快速入门¶

Spark 加速可用于任何现有的 Spark 管道。您无需对逻辑进行任何更改。

要在 Python 转换管道中使用原生加速，您必须完成以下步骤：

将您的 Python 代码仓库升级到最新版本。
配置一个堆外内存配置文件。
启用 VELOX 后端，如下面的代码片段所示：

from transforms.api import configure, ComputeBackend, Input, Output, transform_df


@configure(
    ["EXECUTOR_MEMORY_MEDIUM", "EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH"],
    backend=ComputeBackend.VELOX)
@transform_df(
    Output('/Project/folder/output'),
    source_df=Input('/Project/folder/input'),
)
def compute(source_df):
    ...

为加速 Spark 配置内存¶

要优化您的原生加速 Spark 项目，首先使用 EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH 设置来配置堆外内存。这部分内存由 Velox 使用，它会在 JVM 之外处理某些任务。观察性能表现，并根据需要上下调整堆外内存。

:::callout{neutral} 要使用分数形式的堆外内存配置文件，您还必须设置一个 EXECUTOR_MEMORY_X 配置文件。您的作业很可能已经设置了该配置。 :::