Troubleshoot out-of-memory (OOM) errors（排查内存不足(OOM)错误）¶

Out-of-memory errors can show up in a job in a few ways:

Seeing “Job aborted due to stage failure”
Seeing “ExecutorLostFailure”
Seeing “Spark module died while job [jobID] was using it. (ExitReason: MODULE_UNREACHABLE)”
Seeing Connection lost to driver

These error messages indicate you've gone past the maximum permitted memory for this build. This is usually not a fault with the platform, but a problem with the build you've asked the platform to run. There are a few steps you can take to reduce the memory required to run a build.

To troubleshoot, perform the following steps:

If your transform is written in Python or Pandas:
Move your computation into PySpark as much as possible to benefit from the power of the entire compute cluster. Logic in raw Python and Pandas is executed in the driver on a single processor which is probably slower than your laptop.
If your transform is using joins:
Look for 'null joins' - joins onto columns where many of the row values are null. This can significantly increase the memory consumption of a join. To fix this, you can filter out nulls from problematic columns in your transform or in a previous transform.
Look for joins that greatly increase the number of rows in the output dataset and confirm this is necessary. One tip is to run an Analysis computing the number of rows per key in a dataset and the resultant rows after the join.
Look for recursive joins that can be checkpointed. Repeated joins onto a single table will cause the query plan to grow rapidly. Use checkpoints to cache intermediate results to avoid this.
Check the size of files in your input datasets (Dataset → Details → Files → Dataset Files). They should be at least 128MB each. If they're too small, or much too large, you'll need to repartition them.
Split the transform into multiple smaller transforms. This can also help you identify which part of the transform is causing the failure.
Remove columns you don't need from the input datasets or pre-filter datasets to remove rows you don't need to reduce the amount of data Spark has to hold in memory.
If you can, simplify the logic of your transform.
In cases where code optimization is not enough to build your job successfully, you can allocate additional resources. Note that adding additional resources can increase compute costs associated with the build. For more information on allocating additional resources, see:
When to modify your Spark profile from the default: Guidance around which Spark profiles to increase, based on your error.
Spark profiles reference: List of Spark profiles available in Foundry.
Apply Spark profiles: Instructions for how to apply custom Spark properties to your jobs.

中文翻译¶

排查内存不足(OOM)错误¶

内存不足错误可能以多种方式在作业中出现：

看到 "Job aborted due to stage failure"
看到 "ExecutorLostFailure"
看到 "Spark module died while job [jobID] was using it. (ExitReason: MODULE_UNREACHABLE)"
看到 Connection lost to driver

这些错误信息表明您已超出此构建允许的最大内存限制。这通常不是平台本身的问题，而是您要求平台运行的构建存在问题。您可以采取以下步骤来减少运行构建所需的内存。

请按照以下步骤进行排查：

如果您的转换(transform)是用 Python 或 Pandas 编写的：
尽可能将计算逻辑迁移到 PySpark，以充分利用整个计算集群的能力。原始 Python 和 Pandas 中的逻辑会在单个处理器的驱动程序(driver)上执行，其速度可能比您的笔记本电脑还慢。
如果您的转换(transform)使用了连接(join)操作：
检查是否存在"空值连接(null joins)"——即连接列中包含大量空值的情况。这会显著增加连接操作的内存消耗。解决方法是在当前转换或前置转换中过滤掉问题列中的空值。
检查是否存在导致输出数据集行数大幅增加的连接操作，并确认这种增加是否必要。一个建议是运行分析(Analysis)，计算数据集中每个键的行数以及连接后的结果行数。
检查是否存在可设置检查点(checkpoint)的递归连接。对同一张表反复进行连接操作会导致查询计划(query plan)快速增长。使用检查点(checkpoint)来缓存中间结果可以避免这种情况。
检查输入数据集中的文件大小 (Dataset → Details → Files → Dataset Files)。每个文件应至少为128MB。如果文件过小或过大，您需要重新分区(repartition)。
将转换(transform)拆分为多个较小的转换。这也有助于您确定转换中导致失败的特定部分。
从输入数据集中移除不需要的列，或预先过滤数据集以移除不需要的行，从而减少 Spark 需要保存在内存中的数据量。
如果可能，简化转换(transform)的逻辑。
在代码优化不足以成功构建作业的情况下，您可以分配额外资源。请注意，增加额外资源可能会提高与构建相关的计算成本。有关分配额外资源的更多信息，请参阅：
何时修改默认的 Spark 配置文件(Spark profile)：根据您的错误类型，提供关于应增加哪些 Spark 配置文件(Spark profile)的指导。
Spark 配置文件(Spark profile)参考：Foundry 中可用的 Spark 配置文件(Spark profile)列表。
应用 Spark 配置文件(Spark profile)：如何为您的作业应用自定义 Spark 属性的说明。