Debug Python transforms(调试 Python 转换)¶
This guide provides an overview of debugging techniques available in Python transforms. More information on errors and exceptions can be found in the Python documentation ↗.
Using the debugger¶
A debugger is a useful tool for identifying and resolving issues in your Python transforms. You can set breakpoints to pause transform execution and examine variables, view DataFrames, and understand functions and libraries.
Learn more about the debugger for your chosen IDE:
Reading Python tracebacks¶
A traceback in Python is an error message that contains the sequence of function calls that led to an error, also known as a stack trace in other programming languages. In Python, any unhandled exceptions will result in a traceback, and the most recent call will be at the bottom.
Most Python transforms runtime failures surface as tracebacks, so it is important to understand how to read them.
Consider the following code example:
class Stats(object):
nums = []
def add(self, n):
self.nums.append(n)
def sum(self):
return sum(self.nums)
def avg(self):
return self.sum() / len(self.nums)
def main():
stats = Statistics()
stats.add(1)
stats.add(2)
stats.add(3)
print(stats.avg())
Running this code results in the following traceback:
Traceback (most recent call last):
File "test.py", line 26, in <module>
main()
File "test.py", line 16, in main
stats = Statistics()
NameError: global name 'Statistics' is not defined
Unlike stack traces in other programming languages, Python tracebacks show the most recent call last. From the bottom-up, the traceback shows the following:
- The exception name,
NameError↗. There are many built-in Python exception classes ↗, but it is also possible for code to define its own exception classes. - The exception message,
global name 'Statistics' is not defined. This message contains the most useful information for debugging purposes. - The sequence of function calls leading up to the thrown exception:
File "test.py", line 26, in <module>followed by the line of code in question (line 16).
Using this traceback, we can see that the exception occurs at line 16 of test.py in the main method. Specifically, the line of code causing the error is stats = Statistics(), and the exception thrown is NameError. From this, we can deduce that the name Statistics does not exist. Looking back at the example code, it appears that the name Stats should have been used instead of Statistics.
Logging¶
For logging, you can use the following options:
- Simple
printstatements. This method is supported for standard (lightweight) transforms, but not for Spark transforms. - The standard Python logging module ↗. This method is supported for all transform types. Note that only
INFO-level logs and higher are saved.
Logs are available in VS Code under Output and in the Builds application under Actions > View logs.
The following code example demonstrates how you can output logs to help with debugging:
```python tab="Polars" from transforms.api import transform, Input, Output import polars as pl import logging
log = logging.getLogger(name)
@transform.using( output=Output("/path/output"), input=Input("/path/input"), ) def my_compute_function(output, input): input_df = input.polars() log.info("Number of rows: %d", input_df.height) print("Number of columns: " + str(input_df.width)) output.write_table(input_df)
```python tab="DuckDB"
from transforms.api import transform, Input, Output
import polars as pl
import logging
log = logging.getLogger(__name__)
@transform.using(
output=Output("/path/output"),
input=Input("/path/input"),
)
def my_compute_function(ctx, output, input):
conn = ctx.duckdb().conn
input_df = conn.sql("SELECT * FROM input").fetchdf()
log.info("Number of rows: %d", input_df.shape[0])
print("Number of columns: " + str(input_df.shape[1]))
output.write_table(input_df)
```python tab="Pandas" from transforms.api import transform, Input, Output import pandas as pd import logging
log = logging.getLogger(name)
@transform.using( output=Output("/path/output"), input=Input("/path/input"), ) def my_compute_function(output, input): input_df = input.pandas() log.info("Number of rows: %d", input_df.shape[0]) print("Number of columns:" + str(input_df.shape[1])) output.write_table(input_df)
```python tab="PySpark"
from transforms.api import transform_df, Input, Output
from myproject.datasets import utils
import logging
log = logging.getLogger(__name__)
@transform_df(
Output("/path/output"),
input=Input("/path/input"),
)
def my_compute_function(input):
log.info("Number of rows: %d", input.count())
log.info("Number of columns: %d", len(input.columns))
return input
:::callout{theme="neutral"} You can find additional information about working with Spark logs in the Spark documentation. :::
中文翻译¶
调试 Python 转换¶
本指南概述了 Python 转换中可用的调试技术。有关错误和异常的更多信息,请参阅 Python 文档 ↗。
使用调试器¶
调试器是识别和解决 Python 转换问题的有用工具。您可以设置断点来暂停转换执行,检查变量、查看 DataFrame,并理解函数和库。
了解您所选 IDE 的调试器:
阅读 Python 回溯¶
Python 中的回溯(traceback)是一种错误消息,包含导致错误的函数调用序列,在其他编程语言中也称为堆栈跟踪(stack trace)。在 Python 中,任何未处理的异常都会产生回溯,且最近的调用会显示在底部。
大多数 Python 转换运行时失败都会以回溯形式呈现,因此理解如何阅读它们非常重要。
考虑以下代码示例:
class Stats(object):
nums = []
def add(self, n):
self.nums.append(n)
def sum(self):
return sum(self.nums)
def avg(self):
return self.sum() / len(self.nums)
def main():
stats = Statistics()
stats.add(1)
stats.add(2)
stats.add(3)
print(stats.avg())
运行此代码会产生以下回溯:
Traceback (most recent call last):
File "test.py", line 26, in <module>
main()
File "test.py", line 16, in main
stats = Statistics()
NameError: global name 'Statistics' is not defined
与其他编程语言中的堆栈跟踪不同,Python 回溯将最近的调用显示在最后。从下往上,回溯显示以下内容:
- 异常名称,
NameError↗。有许多内置的 Python 异常类 ↗,但代码也可以定义自己的异常类。 - 异常消息,
global name 'Statistics' is not defined。此消息包含对调试最有用的信息。 - 导致抛出异常的函数调用序列:
File "test.py", line 26, in <module>后跟有问题的代码行(第 16 行)。
使用此回溯,我们可以看到异常发生在 test.py 第 16 行的 main 方法中。具体来说,导致错误的代码行是 stats = Statistics(),抛出的异常是 NameError。由此我们可以推断出名称 Statistics 不存在。回顾示例代码,似乎应该使用名称 Stats 而不是 Statistics。
日志记录¶
对于日志记录,您可以使用以下选项:
- 简单的
print语句。此方法支持标准(轻量级)转换,但不支持 Spark 转换。 - 标准的 Python 日志记录模块 ↗。此方法支持所有转换类型。请注意,只有
INFO级别及以上的日志才会被保存。
日志可在 VS Code 的 输出 以及 构建应用程序 的 操作 > 查看日志 中查看。
以下代码示例演示了如何输出日志以帮助调试:
```python tab="Polars" from transforms.api import transform, Input, Output import polars as pl import logging
log = logging.getLogger(name)
@transform.using( output=Output("/path/output"), input=Input("/path/input"), ) def my_compute_function(output, input): input_df = input.polars() log.info("Number of rows: %d", input_df.height) print("Number of columns: " + str(input_df.width)) output.write_table(input_df)
```python tab="DuckDB"
from transforms.api import transform, Input, Output
import polars as pl
import logging
log = logging.getLogger(__name__)
@transform.using(
output=Output("/path/output"),
input=Input("/path/input"),
)
def my_compute_function(ctx, output, input):
conn = ctx.duckdb().conn
input_df = conn.sql("SELECT * FROM input").fetchdf()
log.info("Number of rows: %d", input_df.shape[0])
print("Number of columns: " + str(input_df.shape[1]))
output.write_table(input_df)
```python tab="Pandas" from transforms.api import transform, Input, Output import pandas as pd import logging
log = logging.getLogger(name)
@transform.using( output=Output("/path/output"), input=Input("/path/input"), ) def my_compute_function(output, input): input_df = input.pandas() log.info("Number of rows: %d", input_df.shape[0]) print("Number of columns:" + str(input_df.shape[1])) output.write_table(input_df)
```python tab="PySpark"
from transforms.api import transform_df, Input, Output
from myproject.datasets import utils
import logging
log = logging.getLogger(__name__)
@transform_df(
Output("/path/output"),
input=Input("/path/input"),
)
def my_compute_function(input):
log.info("Number of rows: %d", input.count())
log.info("Number of columns: %d", len(input.columns))
return input
:::callout{theme="neutral"} 您可以在 Spark 文档 中找到有关使用 Spark 日志的更多信息。 :::