跳转至

Spark profiles reference(Spark 配置文件参考)

This page is a reference of the Spark profiles available in Foundry. Learn more about Spark and profiles here:

Driver cores

The profiles in this family configure the value of spark.driver.cores.

This controls how many CPU cores are assigned to the spark driver. In practice, except in special cases where many spark jobs are running concurrently in the same spark module, this should not need to be overridden.

Driver memory

The profiles in this family configure the value of spark.driver.memory.

This controls how much memory is assigned to the spark driver JVM. This may need to be raised in some situations, for example when collecting large amounts of data back to the driver, or when performing large broadcast joins.

:::callout{theme="neutral"} This only controls the JVM memory, not the memory available to Python processes. If you are pulling lots of data locally to perform transformation using Pandas, you will need a different profile. :::

Executor cores

The profiles in this family configure the value of spark.executor.cores.

This controls how many CPU cores are assigned to each spark executor, which in turn controls how many tasks are run concurrently in each executor. In practice this should rarely need to be overridden in normal transforms jobs.

Executor memory

The profiles in this family configure the value of spark.executor.memory and associated settings.

This controls how much memory is assigned to each spark executor JVM. This may need to be raised if the amount of data being processed in each spark task is very large.

:::callout{theme="neutral"} This memory is shared between all tasks running on the executor (controlled by the Executor Cores profiles). :::

Executor memory overhead

The profiles in this family configure the value of spark.executor.memoryOverhead.

This controls how much memory is assigned to each container in addition to the spark executor JVM memory. This may need to be raised if your job requires a significant amount of memory outside the JVM.

Number of executors

The profiles in this family configure the value of spark.executor.instances and associated settings.

This controls how many executors are requested to run the job. Increasing this value increases the number of tasks which can run in parallel, therefore increasing performance (provided the job is parallel enough) at the cost of using more resources.

In practice this should only need to be overridden for large jobs with a particular organizational need to run very quickly.

Dynamic allocation

The profiles in this family configure the value of spark.dynamicAllocation.enabled, spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors.

This controls how many executors are requested to run the job by specifying a range of executors rather than a fixed count. Spark will scale up the number of executors requested up to maxExecutors and will relinquish the executors when they are not needed, which might be helpful when the exact number of needed executors is not consistently the same, or in some cases for speeding up launch times. The module is not guaranteed to receive the number of requested maxExecutors, and given the variable number of executors, performance might differ from a run to another.

In practice this should only need to be overridden for large jobs with a particular understanding of the advantages and disadvantages of dynamic allocation.

Adaptive query execution

The profiles in this family enable and disable adaptive query execution (AQE).

With AQE enabled, Spark will automatically set the number of partitions at runtime, potentially speeding up your builds. It avoids too few partitions with insufficient parallelism, and too many small partitions with excessive overhead.

AQE aims for a balanced output size of 64 MB per partition. E.g. a total output size of 512 MB will produce around 8 partitions.

You can increase the target size using the file size profiles in this family. Partition sizes of 128MB and larger are recommended if the data written is frequently read, e.g. in Contour analyses.

:::callout{theme="neutral"} You might want to disable AQE if the total output is small but very time-intensive to compute, for example because of expensive UDFs. In that case AQE can reduce parallelism and slow down your computation. :::

Number of cores per task

The profiles in this family configure the value of spark.task.cpus.

This controls how many cores are allocated for each task. In practice this should only rarely be overriden. If you want to control the parallelism of your job you should look into Executor Cores instead.

Arrow

Use these profiles to enable or disable Arrow for conversion between Pandas and PySpark dataframes. To use Arrow, ensure that your Transform depends on the pyarrow package.

When calling spark.createDataFrame() with a Pandas dataframe or toPandas(), Spark has to serialize all rows to convert them from one format to the other. For large dataframes this is a slow process and can be the bottleneck for your Transform. When using a Pandas Transform, this serialization happens both when reading and when writing your data.

Arrow is a more efficient serialization format that significantly speeds up this conversion (as reported on the Arrow website ↗).

Kubernetes

The profiles in this family control low-level details of how your Spark job is executed.

When using libraries that are not agnostic to CPU architecture of underlying machines, you can use profiles to force the Spark job to run on a specific architecture. Note that some environments only have access to machines with AMD architecture; jobs that use ARM architecture override will not succeed in those environments.

The KUBERNETES_OPEN_PORTS_ALL profile enables network communication between Spark executors, which is required for distributed model training.

Column statistics

This profile enables the computation of per-column statistics including min, max, and mean values. This profile is set to false by default, meaning that only per-column null counts and dataset-level metrics such as row count will be computed.

Profile table

Profile Family Profile Name Spark Settings
Driver Cores DRIVER_CORES_SMALL spark.driver.cores: 1
Driver Cores DRIVER_CORES_MEDIUM spark.driver.cores: 2
Driver Cores DRIVER_CORES_LARGE spark.driver.cores: 4
Driver Cores DRIVER_CORES_EXTRA_LARGE spark.driver.cores: 8
Driver Cores DRIVER_CORES_EXTRA_EXTRA_LARGE spark.driver.cores: 16
Driver Memory DRIVER_MEMORY_SMALL spark.driver.memory: 3g
Driver Memory DRIVER_MEMORY_MEDIUM spark.driver.memory: 6g; spark.driver.maxResultSize: 4g
Driver Memory DRIVER_MEMORY_LARGE spark.driver.memory: 13g; spark.driver.maxResultSize: 8g
Driver Memory DRIVER_MEMORY_EXTRA_LARGE spark.driver.memory: 27g; spark.driver.maxResultSize: 16g
Driver Memory DRIVER_MEMORY_EXTRA_EXTRA_LARGE spark.driver.memory: 54g; spark.driver.maxResultSize: 32g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_SMALL spark.driver.memoryOverhead: 1g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_MEDIUM spark.driver.memoryOverhead: 2g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_LARGE spark.driver.memoryOverhead: 4g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_EXTRA_LARGE spark.driver.memoryOverhead: 8g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_EXTRA_EXTRA_LARGE spark.driver.memoryOverhead: 16g
Executor Cores EXECUTOR_CORES_EXTRA_SMALL spark.executor.cores: 1
Executor Cores EXECUTOR_CORES_SMALL spark.executor.cores: 2
Executor Cores EXECUTOR_CORES_MEDIUM spark.executor.cores: 4
Executor Cores EXECUTOR_CORES_LARGE spark.executor.cores: 6
Executor Cores EXECUTOR_CORES_EXTRA_LARGE spark.executor.cores: 8
Executor Memory EXECUTOR_MEMORY_EXTRA_SMALL spark.executor.memory: 3g; spark.executor.memoryOverhead: 768m
Executor Memory EXECUTOR_MEMORY_SMALL spark.executor.memory: 6g; spark.executor.memoryOverhead: 1536m
Executor Memory EXECUTOR_MEMORY_MEDIUM spark.executor.memory: 13g; spark.executor.memoryOverhead: 2g
Executor Memory EXECUTOR_MEMORY_LARGE spark.executor.memory: 27g; spark.executor.memoryOverhead: 3g
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_MINIMUM Share of memory to use for off-heap (an “Executor Memory“ profile must be set): 30%
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_LOW Share of memory to use for off-heap (an “Executor Memory“ profile must be set): 50%
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_MODERATE Share of memory to use for off-heap (an “Executor Memory“ profile must be set): 70%
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH Share of memory to use for off-heap (an “Executor Memory“ profile must be set): 80%
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_MAXIMUM Share of memory to use for off-heap (an “Executor Memory“ profile must be set): 90%
Executor Memory Overhead EXECUTOR_MEMORY_OVERHEAD_SMALL spark.executor.memoryOverhead: 1g
Executor Memory Overhead EXECUTOR_MEMORY_OVERHEAD_MEDIUM spark.executor.memoryOverhead: 2g
Executor Memory Overhead EXECUTOR_MEMORY_OVERHEAD_LARGE spark.executor.memoryOverhead: 4g
Executor Memory Overhead EXECUTOR_MEMORY_OVERHEAD_EXTRA_LARGE spark.executor.memoryOverhead: 8g
Executor Count KUBERNETES_NO_EXECUTORS spark.kubernetes.local.submission: true; spark.sql.shuffle.partitions: 1
Executor Count NUM_EXECUTORS_1 spark.executor.instances: 1; spark.dynamicAllocation.maxExecutors: 1
Executor Count NUM_EXECUTORS_2 spark.executor.instances: 2; spark.dynamicAllocation.maxExecutors: 2
Executor Count NUM_EXECUTORS_4 spark.executor.instances: 4; spark.dynamicAllocation.maxExecutors: 4
Executor Count NUM_EXECUTORS_8 spark.executor.instances: 8; spark.dynamicAllocation.maxExecutors: 8
Executor Count NUM_EXECUTORS_16 spark.executor.instances: 16; spark.dynamicAllocation.maxExecutors: 16
Executor Count NUM_EXECUTORS_32 spark.executor.instances: 32; spark.dynamicAllocation.maxExecutors: 32
Executor Count NUM_EXECUTORS_64 spark.executor.instances: 64; spark.dynamicAllocation.maxExecutors: 64
Executor Count NUM_EXECUTORS_128 spark.executor.instances: 128; spark.dynamicAllocation.maxExecutors: 128
Executor Count NUM_EXECUTORS_256 spark.executor.instances: 256; spark.dynamicAllocation.maxExecutors: 256
Executor Count NUM_EXECUTORS_512 spark.executor.instances: 512; spark.dynamicAllocation.maxExecutors: 512
Task CPU Count TASK_CPUS_2 spark.task.cpus: 2
Task CPU Count TASK_CPUS_4 spark.task.cpus: 4
Dynamic Allocation DYNAMIC_ALLOCATION_DISABLED spark.dynamicAllocation.enabled: false
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED spark.dynamicAllocation.enabled: true
Dynamic Allocation DYNAMIC_ALLOCATION_MIN_2 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 2
Dynamic Allocation DYNAMIC_ALLOCATION_MIN_4 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 4
Dynamic Allocation DYNAMIC_ALLOCATION_MIN_8 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 8
Dynamic Allocation DYNAMIC_ALLOCATION_MIN_16 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 16
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_8 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 8
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_16 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 16
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_32 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 32
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_64 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 64
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_128 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 128
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_1_2 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 1; spark.dynamicAllocation.maxExecutors: 2
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_2_4 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 2; spark.dynamicAllocation.maxExecutors: 4
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_4_8 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 4; spark.dynamicAllocation.maxExecutors: 8
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_8_16 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 8; spark.dynamicAllocation.maxExecutors: 16
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_16_32 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 16; spark.dynamicAllocation.maxExecutors: 32
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_32_64 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 32; spark.dynamicAllocation.maxExecutors: 64
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_64_128 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 64; spark.dynamicAllocation.maxExecutors: 128
Dynamic Allocation DYNAMIC_ALLOCATION_FAST_SCALE_DOWN spark.dynamicAllocation.executorIdleTimeout: 10s
Dynamic Allocation DYNAMIC_ALLOCATION_SLOW_SCALE_UP_2M spark.dynamicAllocation.schedulerBacklogTimeout: 2m
Shuffle Partitions SHUFFLE_PARTITIONS_SMALL spark.sql.shuffle.partitions: 20
Shuffle Partitions SHUFFLE_PARTITIONS_MEDIUM spark.sql.shuffle.partitions: 200
Shuffle Partitions SHUFFLE_PARTITIONS_LARGE spark.sql.shuffle.partitions: 2000
Shuffle Partitions SHUFFLE_PARTITIONS_EXTRA_LARGE spark.sql.shuffle.partitions: 20000
Adaptive Query Execution ADAPTIVE_ENABLED spark.sql.adaptive.enabled: true
Adaptive Query Execution ADAPTIVE_DISABLED spark.sql.adaptive.enabled: false
Adaptive Query Execution ADVISORY_PARTITION_SIZE_MEDIUM spark.sql.adaptive.enabled: true; spark.sql.adaptive.shuffle.targetPostShuffleInputSize: 128MB
Adaptive Query Execution ADVISORY_PARTITION_SIZE_LARGE spark.sql.adaptive.enabled: true; spark.sql.adaptive.shuffle.targetPostShuffleInputSize: 256MB
RPC Message Size RPC_MESSAGE_MAX_SIZE_512M spark.rpc.message.maxSize: 512
RPC Message Size RPC_MESSAGE_MAX_SIZE_1G spark.rpc.message.maxSize: 1024
RPC Message Size RPC_MESSAGE_MAX_SIZE_MAX spark.rpc.message.maxSize: 2047
Legacy LEGACY_ALLOW_UNTYPED_SCALA_UDF spark.sql.legacy.allowUntypedScalaUDF: true
Legacy LEGACY_ALLOW_NEGATIVE_DECIMAL_SCALE spark.sql.legacy.allowNegativeScaleOfDecimal: true
Legacy LEGACY_ALLOW_HASH_ON_MAPTYPE spark.sql.legacy.allowHashOnMapType: true
Legacy LEGACY_NAME_NON_STRUCT_GROUPING_KEY_AS_VALUE spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue: true
Legacy LEGACY_ARRAY_EXISTS_NULL_HANDLING spark.sql.legacy.followThreeValuedLogicInArrayExists: false
Legacy LEGACY_ALLOW_AMBIGUOUS_SELF_JOIN spark.sql.analyzer.failAmbiguousSelfJoin: false
Legacy LEGACY_TIME_PARSER_POLICY spark.sql.legacy.timeParserPolicy: LEGACY
Legacy LEGACY_DATETIME_REBASE_MODE spark.sql.legacy.avro.datetimeRebaseModeInRead: LEGACY; spark.sql.legacy.parquet.datetimeRebaseModeInRead: LEGACY; spark.sql.legacy.avro.datetimeRebaseModeInWrite: LEGACY; spark.sql.legacy.parquet.datetimeRebaseModeInWrite: LEGACY
Legacy LEGACY_FROM_DAYTIME_STRING spark.sql.legacy.fromDayTimeString.enabled: true
Legacy LEGACY_DATETIME_STRING_COMPARISON spark.sql.legacy.typeCoercion.datetimeToString.enabled: true
Dates & Times TIME_PARSER_POLICY_CORRECTED spark.sql.legacy.timeParserPolicy: CORRECTED
Dates & Times SPARK_ALLOW_INT96_AS_TIMESTAMP spark.sql.parquet.int96AsTimestamp: true
Miscellaneous BUCKET_SORTED_SCAN_ENABLED spark.sql.sources.bucketing.sortedScan.enabled: true
Miscellaneous LAST_MAP_KEY_WINS spark.sql.mapKeyDedupPolicy: LAST_WIN
Miscellaneous CROSS_JOIN_ENABLED spark.sql.crossJoin.enabled: true
Miscellaneous SPECULATIVE_EXECUTION spark.speculation: true
Miscellaneous AUTO_BROADCAST_JOIN_DISABLED spark.sql.autoBroadcastJoinThreshold: -1
Miscellaneous ALLOW_ADD_MONTHS spark.foundry.sql.allowAddMonths: true
Miscellaneous PYSPARK_ROW_FIELD_SORTING_ENABLED spark.executorEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: true; spark.yarn.appMasterEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: true; spark.kubernetes.driverEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: true
Miscellaneous PYSPARK_ROW_FIELD_SORTING_DISABLED spark.executorEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: false; spark.yarn.appMasterEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: false; spark.kubernetes.driverEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: false
Miscellaneous PYSPARK_ROW_SCHEMA_CORRUPTION_CHECK_DISABLED spark.kubernetes.driverEnv.PYSPARK_CHECK_ROW_SCHEMA_CORRUPTION: false; spark.yarn.appMasterEnv.PYSPARK_CHECK_ROW_SCHEMA_CORRUPTION: false; spark.executorEnv.PYSPARK_CHECK_ROW_SCHEMA_CORRUPTION: false
Miscellaneous SPARK_KYRO_REFERENCE_TRACKING_DISABLED spark.kryo.referenceTracking: false
Miscellaneous GEOSPARK spark.foundry.build.stats.enabled: false
Miscellaneous SPARK_REFERENCE_TRACKING_DISABLED spark.cleaner.referenceTracking: false
Miscellaneous ENABLE_COLUMN_STATS spark.foundry.build.stats.enableColumnStats: false
Miscellaneous MANAGED_PROFILE Enables automatic profile optimization based on historical job metrics
Arrow ARROW_ENABLED spark.sql.execution.arrow.enabled: true; spark.sql.execution.arrow.pyspark.enabled: true; spark.sql.execution.arrow.sparkr.enabled: true; spark.sql.execution.arrow.fallback.enabled: true; spark.sql.execution.arrow.pyspark.fallback.enabled: true
Arrow ARROW_DISABLED spark.sql.execution.arrow.enabled: false; spark.sql.execution.arrow.pyspark.enabled: false; spark.sql.execution.arrow.sparkr.enabled: false
Kubernetes KUBERNETES_CPU_ARCHITECTURE_OVERRIDE_AMD64 N/A
Kubernetes KUBERNETES_CPU_ARCHITECTURE_OVERRIDE_ARM64 N/A
Kubernetes KUBERNETES_OPEN_PORTS_ALL Enables network communication between Spark executors for distributed model training

中文翻译


Spark 配置文件参考

本页是 Foundry 中可用的 Spark 配置文件(Profile)的参考指南。有关 Spark 和配置文件的更多信息,请参见:

驱动程序核心

此系列配置文件用于配置 spark.driver.cores 的值。

这控制分配给 Spark 驱动程序(Driver)的 CPU 核心数。实际上,除非在同一 Spark 模块中并发运行大量 Spark 作业的特殊情况,否则通常不需要覆盖此设置。

驱动程序内存

此系列配置文件用于配置 spark.driver.memory 的值。

这控制分配给 Spark 驱动程序 JVM 的内存大小。在某些情况下可能需要调高此值,例如将大量数据收集回驱动程序时,或执行大型广播联接(Broadcast Join)时。

:::callout{theme="neutral"} 这仅控制 JVM 内存,而不控制 Python 进程可用的内存。如果您在本地拉取大量数据以使用 Pandas 执行转换(Transform),则需要使用不同的配置文件。 :::

执行器核心

此系列配置文件用于配置 spark.executor.cores 的值。

这控制分配给每个 Spark 执行器(Executor)的 CPU 核心数,进而控制每个执行器中并发运行的任务(Task)数。实际上,在常规转换作业中很少需要覆盖此设置。

执行器内存

此系列配置文件用于配置 spark.executor.memory 及相关设置的值。

这控制分配给每个 Spark 执行器 JVM 的内存大小。如果每个 Spark 任务处理的数据量非常大,可能需要调高此值。

:::callout{theme="neutral"} 此内存在执行器上运行的所有任务之间共享(由“执行器核心”配置文件控制)。 :::

执行器内存开销

此系列配置文件用于配置 spark.executor.memoryOverhead 的值。

这控制除 Spark 执行器 JVM 内存外,分配给每个容器的额外内存大小。如果您的作业在 JVM 之外需要大量内存,可能需要调高此值。

执行器数量

此系列配置文件用于配置 spark.executor.instances 及相关设置的值。

这控制请求运行该作业的执行器数量。增加此值会增加可并行运行的任务数,从而提升性能(前提是作业具有足够的并行度),但代价是消耗更多资源。

实际上,仅当大型作业有特定的组织需求且需要极快运行速度时,才需要覆盖此设置。

动态分配

此系列配置文件用于配置 spark.dynamicAllocation.enabledspark.dynamicAllocation.minExecutorsspark.dynamicAllocation.maxExecutors 的值。

这通过指定执行器数量范围而非固定数量来控制请求运行作业的执行器数。Spark 会将请求的执行器数扩展至 maxExecutors,并在不需要时释放执行器。当所需执行器数不固定,或在某些情况下为了加快启动速度时,这非常有用。模块不保证能接收到请求的 maxExecutors 数量,且由于执行器数量可变,不同次运行的性能可能会有所差异。

实际上,仅当大型作业对动态分配(Dynamic Allocation)的优缺点有充分了解时,才需要覆盖此设置。

自适应查询执行

此系列配置文件用于启用和禁用自适应查询执行(Adaptive Query Execution, AQE)。

启用 AQE 后,Spark 将在运行时自动设置分区(Partition)数,从而可能加快构建速度。它能避免分区过少导致并行度不足,以及分区过多过小导致开销过大的问题。

AQE 的目标是每个分区达到 64 MB 的均衡输出大小。例如,512 MB 的总输出大小将生成约 8 个分区。

您可以使用此系列中的文件大小配置文件来增加目标大小。如果写入的数据被频繁读取(例如在 Contour 分析中),建议使用 128MB 或更大的分区大小。

:::callout{theme="neutral"} 如果总输出较小但计算非常耗时(例如由于昂贵的 UDF),您可能需要禁用 AQE。在这种情况下,AQE 可能会降低并行度并减慢计算速度。 :::

每任务核心数

此系列配置文件用于配置 spark.task.cpus 的值。

这控制为每个任务分配的核心数。实际上很少需要覆盖此设置。如果您想控制作业的并行度,应改为查看执行器核心

Arrow

使用这些配置文件可启用或禁用 Arrow,以用于 Pandas 和 PySpark 数据帧(Dataframe)之间的转换。要使用 Arrow,请确保您的转换依赖于 pyarrow 包。

当使用 Pandas 数据帧调用 spark.createDataFrame() 或调用 toPandas() 时,Spark 必须序列化所有行以在两种格式之间进行转换。对于大型数据帧,这是一个缓慢的过程,并可能成为转换的瓶颈。使用 Pandas 转换时,这种序列化在读取和写入数据时都会发生。

Arrow 是一种更高效的序列化格式,可显著加快此转换速度(如 Arrow 网站 ↗ 所述)。

Kubernetes

此系列配置文件控制 Spark 作业执行方式的底层细节。

当使用对底层机器 CPU 架构敏感的库时,您可以使用配置文件强制 Spark 作业在特定架构上运行。请注意,某些环境只能访问 AMD 架构的机器;在这些环境中,使用 ARM 架构覆盖的作业将无法成功。

KUBERNETES_OPEN_PORTS_ALL 配置文件启用 Spark 执行器之间的网络通信,这是分布式模型训练所必需的。

列统计信息

此配置文件启用按列统计信息(Column Statistics)的计算,包括 minmaxmean 值。此配置文件默认设置为 false,这意味着仅计算按列的 null 计数和数据集级别的指标(如行数)。

配置文件表

配置文件系列 配置文件名称 Spark 设置
Driver Cores DRIVER_CORES_SMALL spark.driver.cores: 1
Driver Cores DRIVER_CORES_MEDIUM spark.driver.cores: 2
Driver Cores DRIVER_CORES_LARGE spark.driver.cores: 4
Driver Cores DRIVER_CORES_EXTRA_LARGE spark.driver.cores: 8
Driver Cores DRIVER_CORES_EXTRA_EXTRA_LARGE spark.driver.cores: 16
Driver Memory DRIVER_MEMORY_SMALL spark.driver.memory: 3g
Driver Memory DRIVER_MEMORY_MEDIUM spark.driver.memory: 6g; spark.driver.maxResultSize: 4g
Driver Memory DRIVER_MEMORY_LARGE spark.driver.memory: 13g; spark.driver.maxResultSize: 8g
Driver Memory DRIVER_MEMORY_EXTRA_LARGE spark.driver.memory: 27g; spark.driver.maxResultSize: 16g
Driver Memory DRIVER_MEMORY_EXTRA_EXTRA_LARGE spark.driver.memory: 54g; spark.driver.maxResultSize: 32g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_SMALL spark.driver.memoryOverhead: 1g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_MEDIUM spark.driver.memoryOverhead: 2g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_LARGE spark.driver.memoryOverhead: 4g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_EXTRA_LARGE spark.driver.memoryOverhead: 8g
Driver Memory Overhead DRIVER_MEMORY_OVERHEAD_EXTRA_EXTRA_LARGE spark.driver.memoryOverhead: 16g
Executor Cores EXECUTOR_CORES_EXTRA_SMALL spark.executor.cores: 1
Executor Cores EXECUTOR_CORES_SMALL spark.executor.cores: 2
Executor Cores EXECUTOR_CORES_MEDIUM spark.executor.cores: 4
Executor Cores EXECUTOR_CORES_LARGE spark.executor.cores: 6
Executor Cores EXECUTOR_CORES_EXTRA_LARGE spark.executor.cores: 8
Executor Memory EXECUTOR_MEMORY_EXTRA_SMALL spark.executor.memory: 3g; spark.executor.memoryOverhead: 768m
Executor Memory EXECUTOR_MEMORY_SMALL spark.executor.memory: 6g; spark.executor.memoryOverhead: 1536m
Executor Memory EXECUTOR_MEMORY_MEDIUM spark.executor.memory: 13g; spark.executor.memoryOverhead: 2g
Executor Memory EXECUTOR_MEMORY_LARGE spark.executor.memory: 27g; spark.executor.memoryOverhead: 3g
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_MINIMUM 用于堆外(Off-heap)内存的份额(必须设置“Executor Memory”配置文件):30%
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_LOW 用于堆外内存的份额(必须设置“Executor Memory”配置文件):50%
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_MODERATE 用于堆外内存的份额(必须设置“Executor Memory”配置文件):70%
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_HIGH 用于堆外内存的份额(必须设置“Executor Memory”配置文件):80%
Executor Memory Off-heap EXECUTOR_MEMORY_OFFHEAP_FRACTION_MAXIMUM 用于堆外内存的份额(必须设置“Executor Memory”配置文件):90%
Executor Memory Overhead EXECUTOR_MEMORY_OVERHEAD_SMALL spark.executor.memoryOverhead: 1g
Executor Memory Overhead EXECUTOR_MEMORY_OVERHEAD_MEDIUM spark.executor.memoryOverhead: 2g
Executor Memory Overhead EXECUTOR_MEMORY_OVERHEAD_LARGE spark.executor.memoryOverhead: 4g
Executor Memory Overhead EXECUTOR_MEMORY_OVERHEAD_EXTRA_LARGE spark.executor.memoryOverhead: 8g
Executor Count KUBERNETES_NO_EXECUTORS spark.kubernetes.local.submission: true; spark.sql.shuffle.partitions: 1
Executor Count NUM_EXECUTORS_1 spark.executor.instances: 1; spark.dynamicAllocation.maxExecutors: 1
Executor Count NUM_EXECUTORS_2 spark.executor.instances: 2; spark.dynamicAllocation.maxExecutors: 2
Executor Count NUM_EXECUTORS_4 spark.executor.instances: 4; spark.dynamicAllocation.maxExecutors: 4
Executor Count NUM_EXECUTORS_8 spark.executor.instances: 8; spark.dynamicAllocation.maxExecutors: 8
Executor Count NUM_EXECUTORS_16 spark.executor.instances: 16; spark.dynamicAllocation.maxExecutors: 16
Executor Count NUM_EXECUTORS_32 spark.executor.instances: 32; spark.dynamicAllocation.maxExecutors: 32
Executor Count NUM_EXECUTORS_64 spark.executor.instances: 64; spark.dynamicAllocation.maxExecutors: 64
Executor Count NUM_EXECUTORS_128 spark.executor.instances: 128; spark.dynamicAllocation.maxExecutors: 128
Executor Count NUM_EXECUTORS_256 spark.executor.instances: 256; spark.dynamicAllocation.maxExecutors: 256
Executor Count NUM_EXECUTORS_512 spark.executor.instances: 512; spark.dynamicAllocation.maxExecutors: 512
Task CPU Count TASK_CPUS_2 spark.task.cpus: 2
Task CPU Count TASK_CPUS_4 spark.task.cpus: 4
Dynamic Allocation DYNAMIC_ALLOCATION_DISABLED spark.dynamicAllocation.enabled: false
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED spark.dynamicAllocation.enabled: true
Dynamic Allocation DYNAMIC_ALLOCATION_MIN_2 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 2
Dynamic Allocation DYNAMIC_ALLOCATION_MIN_4 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 4
Dynamic Allocation DYNAMIC_ALLOCATION_MIN_8 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 8
Dynamic Allocation DYNAMIC_ALLOCATION_MIN_16 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 16
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_8 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 8
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_16 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 16
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_32 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 32
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_64 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 64
Dynamic Allocation DYNAMIC_ALLOCATION_MAX_128 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.maxExecutors: 128
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_1_2 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 1; spark.dynamicAllocation.maxExecutors: 2
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_2_4 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 2; spark.dynamicAllocation.maxExecutors: 4
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_4_8 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 4; spark.dynamicAllocation.maxExecutors: 8
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_8_16 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 8; spark.dynamicAllocation.maxExecutors: 16
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_16_32 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 16; spark.dynamicAllocation.maxExecutors: 32
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_32_64 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 32; spark.dynamicAllocation.maxExecutors: 64
Dynamic Allocation DYNAMIC_ALLOCATION_ENABLED_64_128 spark.dynamicAllocation.enabled: true; spark.dynamicAllocation.minExecutors: 64; spark.dynamicAllocation.maxExecutors: 128
Dynamic Allocation DYNAMIC_ALLOCATION_FAST_SCALE_DOWN spark.dynamicAllocation.executorIdleTimeout: 10s
Dynamic Allocation DYNAMIC_ALLOCATION_SLOW_SCALE_UP_2M spark.dynamicAllocation.schedulerBacklogTimeout: 2m
Shuffle Partitions SHUFFLE_PARTITIONS_SMALL spark.sql.shuffle.partitions: 20
Shuffle Partitions SHUFFLE_PARTITIONS_MEDIUM spark.sql.shuffle.partitions: 200
Shuffle Partitions SHUFFLE_PARTITIONS_LARGE spark.sql.shuffle.partitions: 2000
Shuffle Partitions SHUFFLE_PARTITIONS_EXTRA_LARGE spark.sql.shuffle.partitions: 20000
Adaptive Query Execution ADAPTIVE_ENABLED spark.sql.adaptive.enabled: true
Adaptive Query Execution ADAPTIVE_DISABLED spark.sql.adaptive.enabled: false
Adaptive Query Execution ADVISORY_PARTITION_SIZE_MEDIUM spark.sql.adaptive.enabled: true; spark.sql.adaptive.shuffle.targetPostShuffleInputSize: 128MB
Adaptive Query Execution ADVISORY_PARTITION_SIZE_LARGE spark.sql.adaptive.enabled: true; spark.sql.adaptive.shuffle.targetPostShuffleInputSize: 256MB
RPC Message Size RPC_MESSAGE_MAX_SIZE_512M spark.rpc.message.maxSize: 512
RPC Message Size RPC_MESSAGE_MAX_SIZE_1G spark.rpc.message.maxSize: 1024
RPC Message Size RPC_MESSAGE_MAX_SIZE_MAX spark.rpc.message.maxSize: 2047
Legacy LEGACY_ALLOW_UNTYPED_SCALA_UDF spark.sql.legacy.allowUntypedScalaUDF: true
Legacy LEGACY_ALLOW_NEGATIVE_DECIMAL_SCALE spark.sql.legacy.allowNegativeScaleOfDecimal: true
Legacy LEGACY_ALLOW_HASH_ON_MAPTYPE spark.sql.legacy.allowHashOnMapType: true
Legacy LEGACY_NAME_NON_STRUCT_GROUPING_KEY_AS_VALUE spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue: true
Legacy LEGACY_ARRAY_EXISTS_NULL_HANDLING spark.sql.legacy.followThreeValuedLogicInArrayExists: false
Legacy LEGACY_ALLOW_AMBIGUOUS_SELF_JOIN spark.sql.analyzer.failAmbiguousSelfJoin: false
Legacy LEGACY_TIME_PARSER_POLICY spark.sql.legacy.timeParserPolicy: LEGACY
Legacy LEGACY_DATETIME_REBASE_MODE spark.sql.legacy.avro.datetimeRebaseModeInRead: LEGACY; spark.sql.legacy.parquet.datetimeRebaseModeInRead: LEGACY; spark.sql.legacy.avro.datetimeRebaseModeInWrite: LEGACY; spark.sql.legacy.parquet.datetimeRebaseModeInWrite: LEGACY
Legacy LEGACY_FROM_DAYTIME_STRING spark.sql.legacy.fromDayTimeString.enabled: true
Legacy LEGACY_DATETIME_STRING_COMPARISON spark.sql.legacy.typeCoercion.datetimeToString.enabled: true
Dates & Times TIME_PARSER_POLICY_CORRECTED spark.sql.legacy.timeParserPolicy: CORRECTED
Dates & Times SPARK_ALLOW_INT96_AS_TIMESTAMP spark.sql.parquet.int96AsTimestamp: true
Miscellaneous BUCKET_SORTED_SCAN_ENABLED spark.sql.sources.bucketing.sortedScan.enabled: true
Miscellaneous LAST_MAP_KEY_WINS spark.sql.mapKeyDedupPolicy: LAST_WIN
Miscellaneous CROSS_JOIN_ENABLED spark.sql.crossJoin.enabled: true
Miscellaneous SPECULATIVE_EXECUTION spark.speculation: true
Miscellaneous AUTO_BROADCAST_JOIN_DISABLED spark.sql.autoBroadcastJoinThreshold: -1
Miscellaneous ALLOW_ADD_MONTHS spark.foundry.sql.allowAddMonths: true
Miscellaneous PYSPARK_ROW_FIELD_SORTING_ENABLED spark.executorEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: true; spark.yarn.appMasterEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: true; spark.kubernetes.driverEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: true
Miscellaneous PYSPARK_ROW_FIELD_SORTING_DISABLED spark.executorEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: false; spark.yarn.appMasterEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: false; spark.kubernetes.driverEnv.PYSPARK_ROW_FIELD_SORTING_ENABLED: false
Miscellaneous PYSPARK_ROW_SCHEMA_CORRUPTION_CHECK_DISABLED spark.kubernetes.driverEnv.PYSPARK_CHECK_ROW_SCHEMA_CORRUPTION: false; spark.yarn.appMasterEnv.PYSPARK_CHECK_ROW_SCHEMA_CORRUPTION: false; spark.executorEnv.PYSPARK_CHECK_ROW_SCHEMA_CORRUPTION: false
Miscellaneous SPARK_KYRO_REFERENCE_TRACKING_DISABLED spark.kryo.referenceTracking: false
Miscellaneous GEOSPARK spark.foundry.build.stats.enabled: false
Miscellaneous SPARK_REFERENCE_TRACKING_DISABLED spark.cleaner.referenceTracking: false
Miscellaneous ENABLE_COLUMN_STATS spark.foundry.build.stats.enableColumnStats: false
Miscellaneous MANAGED_PROFILE 启用基于历史作业指标的[自动配置文件优化](https://palantir.com/docs/foundry/optimizing-pipelines/managed-profiles/)
Arrow ARROW_ENABLED spark.sql.execution.arrow.enabled: true; spark.sql.execution.arrow.pyspark.enabled: true; spark.sql.execution.arrow.sparkr.enabled: true; spark.sql.execution.arrow.fallback.enabled: true; spark.sql.execution.arrow.pyspark.fallback.enabled: true
Arrow ARROW_DISABLED spark.sql.execution.arrow.enabled: false; spark.sql.execution.arrow.pyspark.enabled: false; spark.sql.execution.arrow.sparkr.enabled: false
Kubernetes KUBERNETES_CPU_ARCHITECTURE_OVERRIDE_AMD64 N/A
Kubernetes KUBERNETES_CPU_ARCHITECTURE_OVERRIDE_ARM64 N/A
Kubernetes KUBERNETES_OPEN_PORTS_ALL 启用 Spark 执行器之间的网络通信,用于[分布式模型训练](https://palantir.com/docs/foundry/integrate-models/spark-distributed-model-training/)