Code Repositories(代码仓库 (Code Repositories))¶
How do I enable a code linter in my repository?¶
You can uncomment the lines related to lints in the transforms-python/build.gradle in your code repository. This will enable a linting task that will provide hints to violations of either pep8 or pylint format rules.
Timestamp: June 12, 2024
How can auto-scaling of executors be achieved in Spark profiles for dynamic scaling based on transform input sizes?¶
Auto-scaling of executors can be achieved by enabling dynamic allocation, which allows for auto-scaling of executors but not for executor/driver memory. Specific profiles such as DYNAMIC_ALLOCATION_MAX_64 and the DYNAMIC_ALLOCATION_ENABLED profile support this functionality. More information and a list of profiles with built-in configurations for dynamic allocation can be found in the Spark profiles reference documentation.
Timestamp: April 5, 2024
How do I enable an auto formatter in my Python code repository?¶
Selecting the Format before committing option when committing code will run the formatCode task. This task can utilize ruff or black as formatters. This can be controlled by uncommenting the respective lines related to formatters in the transforms-python/build.gradle file.
Timestamp: June 12, 2024
What steps should be taken to troubleshoot and resolve an import error stating No module named <module-name>; <package-name> is not a package in a transform?¶
To troubleshoot and resolve the import error, follow these steps:
- Verify the library installation and ensure it is correctly installed in your code repository.
- Check for hidden files and ensure environment configuration is set up properly.
- Resolve any package conflicts by reviewing package resolution logs.
- Re-trigger environment resolution by making a new commit if necessary.
- If the module worked before but stopped working, check the version differences in the library for any major breaking changes.
Timestamp: April 25, 2024
In transforms, what is the correct method to write a pandas dataframe?¶
To write a pandas dataframe, you should use the .write_pandas() method. If you encounter an AttributeError: 'DataFrame' object has no attribute '_jdf', it means you are using a method designed for pyspark dataframes on a pandas dataframe.
Timestamp: May 30, 2024
Is it possible to set a schedule on a transform that doesn't have an output dataset for the purpose of triggering a Python script to hit an external API?¶
No, it is not possible to set a schedule on a transform without an output dataset. The recommended solution is to track the response from the external API in an output dataset for logging. Alternatively, a function triggered via Automate could be used to run arbitrary code, but having an output dataset is still valuable for logging.
Timestamp: February 6, 2025
中文翻译¶
代码仓库 (Code Repositories)¶
如何在仓库中启用代码检查工具 (code linter)?¶
你可以在代码仓库的 transforms-python/build.gradle 文件中取消与代码检查 (lint) 相关行的注释。这将启用一个 linting 任务,用于提示违反 pep8 或 pylint 格式规则的代码。
时间戳: 2024年6月12日
如何在 Spark 配置文件中实现执行器 (executor) 的自动扩缩容,以根据转换输入大小进行动态调整?¶
执行器的自动扩缩容可以通过启用动态分配 (dynamic allocation) 来实现,该功能支持执行器的自动扩缩容,但不支持执行器/驱动内存的自动调整。特定的配置文件(如 DYNAMIC_ALLOCATION_MAX_64 和 DYNAMIC_ALLOCATION_ENABLED)支持此功能。更多信息及内置动态分配配置的配置文件列表,请参阅 Spark 配置文件参考文档。
时间戳: 2024年4月5日
如何在 Python 代码仓库中启用自动格式化工具 (auto formatter)?¶
在提交代码时选择 提交前格式化 (Format before committing) 选项,将运行 formatCode 任务。该任务可以使用 ruff 或 black 作为格式化工具。你可以通过取消 transforms-python/build.gradle 文件中与格式化工具相关行的注释来控制此行为。
时间戳: 2024年6月12日
在转换 (transform) 中遇到 No module named <module-name>; <package-name> is not a package 导入错误时,应如何排查和解决?¶
请按照以下步骤排查并解决导入错误:
- 验证库是否已正确安装,并确保其已正确安装在你的代码仓库中。
- 检查是否存在隐藏文件,并确保环境配置已正确设置。
- 通过查看包解析日志来解决任何包冲突。
- 如有必要,通过进行新的提交来重新触发环境解析。
- 如果该模块之前可以正常工作但现在出现问题,请检查库的版本差异,确认是否存在重大破坏性变更。
时间戳: 2024年4月25日
在转换中,写入 pandas 数据框 (dataframe) 的正确方法是什么?¶
要写入 pandas 数据框,应使用 .write_pandas() 方法。如果遇到 AttributeError: 'DataFrame' object has no attribute '_jdf' 错误,说明你正在对 pandas 数据框使用了专为 pyspark 数据框设计的方法。
时间戳: 2024年5月30日
是否可以为没有输出数据集的转换设置调度,以触发 Python 脚本调用外部 API?¶
不可以,无法为没有输出数据集的转换设置调度。推荐的解决方案是将外部 API 的响应记录到输出数据集中。或者,也可以通过 Automate 触发的函数来运行任意代码,但拥有输出数据集对于日志记录仍然很有价值。
时间戳: 2025年2月6日