Languages（语言）¶

Supported languages and versions¶

Code Workbook currently supports three languages: Python, R, and SQL.

The currently supported versions of Python in Code Workbook include Python 3.10 and Python 3.12. Lower versions of Python are not supported, and environments using them will fail to resolve. We strongly recommend using the latest available version of Python, as Palantir Foundry discontinues support for Python versions that are considered end-of-life ↗ by Python Developer documentation.

The currently supported versions of R include R 3.6, R 4.0, R 4.1 and R 4.2. The versions R 3.3, R 3.4, and R 3.5 are not supported, and their respective environments will fail to initialize.

The SQL variant supported in Code Workbook is Spark SQL ↗.

To enable a specific language on a Code Workbook profile, see the Conda Environment section of the Configuring Code Workbook profiles documentation. A series of examples for each of the supported languages are provided below, in the respective introductions to Python, R, and SQL.

Enable languages in a workbook¶

Specific configuration is necessary for supported languages to function, as discussed in the sections below.

Enable R¶

:::callout{theme="warning"} R is not yet available for self-service. :::

Two things must be true in order to have the ability to create an R transform in Code Workbook:

The R language must be specifically enabled on your enrollment. Contact your Palantir representative for assistance with enablement.
The package vector-spark-module-r must be present in the environment currently used in the workbook. This can be achieved in either of the following ways:
Toggle the R checkbox in the profile's configuration in Control Panel. This will automatically add the vector-spark-module-r package to the profile's environment.
Manually add vector-spark-module-r to the environment using the Add package dropdown menu.

See Configuring Code Workbook profiles for more information.

Enable Python¶

The package vector-spark-module-py must be present in the environment currently in use in the workbook. This can be achieved in either of the following ways:

Toggle the Python checkbox in the profile's configuration in Control Panel. This will automatically add the vector-spark-module-py package to the environment.
Manually add vector-spark-module-py to the environment using the Add package dropdown menu.

See Configuring Code Workbook profiles for more information.

Enable SQL¶

SQL transforms do not require any additional packages to function. As a result, SQL transforms will always be available by default for any given profile.

:::callout{theme="success"} If you do not plan on using either Python or R on a given profile, consider removing the associated vector-spark-module package to reduce your environment. You can always add them back when you need them. :::

Introduction to Python¶

Python transforms¶

A Python transform is defined as a Python function with any number of inputs, at most one output, and optionally one or more visualizations. By referencing a transform's alias as a function argument, Code Workbook will automatically pass as input of the transform the output of the mentioned alias. For more information about transforms in Code Workbook, consult the Transforms overview documentation.

A simple example of a Python transform could include a single PySpark DataFrame as input, transform the data using PySpark syntax, and have a transformed Spark DataFrame as output.

def child(input_spark_dataframe):
    from pyspark.sql import functions as F
    return input_spark_dataframe.filter(F.col('A') == 'value').withColumn('new_column', F.lit(2))

Conversion between Spark and Pandas in Python¶

Within a Python transform, converting between Spark and Pandas dataframes is straightforward.

# Convert to PySpark
spark_df = spark.createDataFrame(pandas_df)
# Convert to pandas
pandas_df = spark_df.toPandas()

Converting to pandas means collecting data to the driver. As a result, the size of the data is constrained by the available driver memory on the Spark module. If you are working with a large dataset, you may want to first filter and aggregate your data using Spark, then collect it into a pandas DataFrame.

from pyspark.sql import functions as F
def filtering_before_pandas(input_spark_dataframe):
    # Use PySpark to filter the data
    filtered_spark_df = input_spark_dataframe.select("name", "age").filter(F.col("age") <= 18)

    # Convert to a pandas df, which collects the data to the driver
    pandas_df = filtered_spark_df.toPandas()

    # Perform pandas operations
    mean_age = pandas_df["age"].mean()
    pandas_df["age_difference_to_mean"] = pandas_df["age"] - mean_age

    # Output the resulting DataFrame
    return pandas_df

:::callout{theme="neutral"} To keep the order of a sorted pandas DataFrame after saving, save it as a Spark DataFrame with a single partition:

import pyspark.pandas as p

return p.from_pandas(df).to_spark().coalesce(1)

:::

Introduction to R¶

R Transforms¶

An R transform is defined as an R function, with any number of inputs, at most one output, and optionally one or more visualizations. By referencing a transform's alias as a function argument, Code Workbook will automatically pass as input of the transform the output of the mentioned alias. For more information about transforms in Code Workbook, consult the transforms overview documentation.

A simple example of an R transform could include one parent R data.frame, transform the data using R, and have one R data.frame as output.

child <- def(r_dataframe) {
    library(tidyverse)
    new_df <- r_dataframe %>% dplyr::select(col_A, col_B) %>% dplyr::filter(col_A == true) %>% dplyr::mutate(new_column=1000)
    return(new_df)
}

Conversion between Spark and R dataframes¶

Within an R transform, converting between Spark DataFrames and R data.frame is straightforward:

# Convert from Spark DataFrame to R data.frame
new_r_df <- SparkR::collect(spark_df)

# Convert from R data.frame to Spark DataFrame
 spark_df <- SparkR::as.DataFrame(r_df)

Note that converting to an R data.frame means collecting data to the driver. As a result, the size of the data is constrained by the available driver memory on the Spark module. If you are working with a large dataset, you may want to first filter and aggregate your data using Spark, then collect it into an R data.frame.

output_dataset <- function(spark_df) {
    library(tidyverse)

    # Use SparkR to filter the data
    input_dataset_filtered <- SparkR::select(spark_df, 'column_A', 'column_B')

    # Convert to R data.frame
    local_df <- SparkR::collect(input_dataset_filtered)

    # Use tidyverse functions to transform your data
    local_df <- local_df %>% dplyr::filter(column_A == true) %>% dplyr::mutate(new_column = 1000)

    # Output an R data.frame
    return(local_df)
}

R troubleshooting¶

When an imported dataset in Code Workbook is read in as an R data.frame, the dataset is converted from a Spark DataFrame to an R data.frame by collecting to the driver.

You will not be able to read in arbitrarily large data as an R data.frame. The size of data is constrained by driver memory on the Spark module. To work with large data, consider first reading in your dataset as a Spark DataFrame, using SparkR to transform the data into something smaller, and then calling SparkR::collect() to convert it to an R data.frame. Alternately, use Python or SQL to transform your data into something smaller prior to using R.
There are some known issues when collecting certain data types to R data.frame. In the majority of cases, when collecting we use a library called r-arrow ↗, which speeds up serialization and deserialization. In particular, when using r-arrow the Long, Array, Map, Struct, and Datetime types are not convertible. Consider dropping these columns or casting them to other data types (such as String). You will receive a warning in the interface when attempting to read an input with these types as an R data.frame.

R in Code Workbook is single-threaded, meaning only one R job can be run at a time on the same Spark module. If you initiate multiple R jobs at the same time, they will run serially; jobs that are queued will appear as "Queueing in Code Workbook."

If you have a long-running job or jobs where the transforms are saved as datasets, we recommend that you run a batch build. The batch build will run on its own Spark module, allowing you to continue iterating in the same workbook or other workbooks that share the Spark module.

Introduction to SQL¶

The SQL variant supported in Code Workbook is Spark SQL ↗. The only supported input and output types are Spark DataFrames.

A simple example of a SQL transform could join two input DataFrames on a join key.

SELECT table_b.col_A, table_b.col_B, table_a.*
FROM table_a
JOIN table_b ON table_a.col_C == table_B.col_C

To add a parent to a SQL node, referencing the alias within the code is not sufficient. You must use the UI by selecting the input bar, or create the child node using the + button.

中文翻译¶

语言¶

支持的语言与版本¶

Code Workbook 目前支持三种语言：Python、R 和 SQL。

Code Workbook 当前支持的 Python 版本包括 Python 3.10 和 Python 3.12。不支持更低版本的 Python，使用这些版本的环境将无法解析。我们强烈建议使用最新的可用 Python 版本，因为 Palantir Foundry 将停止支持 Python 开发者文档中视为生命周期终止版本 ↗ 的 Python 版本。

当前支持的 R 版本包括 R 3.6、R 4.0、R 4.1 和 R 4.2。不支持 R 3.3、R 3.4 和 R 3.5 版本，这些版本对应的环境将无法初始化。

Code Workbook 支持的 SQL 变体为 Spark SQL ↗。

要在 Code Workbook 配置文件(profile)上启用特定语言，请参阅配置 Code Workbook 配置文件文档中的 Conda 环境部分。下文分别在 Python 简介、R 简介和 SQL 简介中提供了每种支持语言的示例系列。

在工作簿中启用语言¶

如以下各节所述，需要特定配置才能使支持的语言正常运行。

启用 R¶

:::callout{theme="warning"} R 尚不支持自助服务。 :::

要在 Code Workbook 中创建 R 转换(transform)，必须满足两个条件：

必须在您的注册(enrollment)中专门启用 R 语言。请联系您的 Palantir 代表以获取启用支持。
工作簿当前使用的环境中必须包含 vector-spark-module-r 包。可通过以下任一方式实现：
在控制面板(Control Panel)的配置文件配置中勾选 R 复选框。这将自动将 vector-spark-module-r 包添加到配置文件的环境中。
使用 添加包(Add package) 下拉菜单手动将 vector-spark-module-r 添加到环境中。

更多信息请参阅配置 Code Workbook 配置文件。

启用 Python¶

工作簿当前使用的环境中必须包含 vector-spark-module-py 包。可通过以下任一方式实现：

在控制面板(Control Panel)的配置文件配置中勾选 Python 复选框。这将自动将 vector-spark-module-py 包添加到环境中。
使用 添加包(Add package) 下拉菜单手动将 vector-spark-module-py 添加到环境中。

更多信息请参阅配置 Code Workbook 配置文件。

启用 SQL¶

SQL 转换(transform)无需任何额外包即可运行。因此，对于任何给定的配置文件，SQL 转换默认始终可用。

:::callout{theme="success"} 如果您不打算在某个配置文件上使用 Python 或 R，请考虑移除相关的 vector-spark-module 包以精简环境。当您需要时，随时可以重新添加。 :::

Python 简介¶

Python 转换¶

Python 转换(transform)定义为一个 Python 函数，可以包含任意数量的输入、最多一个输出，以及可选的一个或多个可视化。通过在函数参数中引用转换的别名(alias)，Code Workbook 会自动将该别名对应的输出作为转换的输入传递。有关 Code Workbook 中转换的更多信息，请参阅转换概述文档。

一个简单的 Python 转换示例可能包含一个 PySpark DataFrame 作为输入，使用 PySpark 语法转换数据，并输出一个转换后的 Spark DataFrame。

def child(input_spark_dataframe):
    from pyspark.sql import functions as F
    return input_spark_dataframe.filter(F.col('A') == 'value').withColumn('new_column', F.lit(2))

Python 中 Spark 与 Pandas 之间的转换¶

在 Python 转换中，Spark DataFrame 与 Pandas DataFrame 之间的转换非常简单。

# 转换为 PySpark
spark_df = spark.createDataFrame(pandas_df)
# 转换为 pandas
pandas_df = spark_df.toPandas()

转换为 pandas 意味着将数据收集到驱动程序(driver)中。因此，数据大小受 Spark 模块上可用驱动程序内存的限制。如果您处理的是大型数据集，建议先使用 Spark 对数据进行过滤和聚合，然后再将其收集到 pandas DataFrame 中。

from pyspark.sql import functions as F
def filtering_before_pandas(input_spark_dataframe):
    # 使用 PySpark 过滤数据
    filtered_spark_df = input_spark_dataframe.select("name", "age").filter(F.col("age") <= 18)

    # 转换为 pandas df，将数据收集到驱动程序
    pandas_df = filtered_spark_df.toPandas()

    # 执行 pandas 操作
    mean_age = pandas_df["age"].mean()
    pandas_df["age_difference_to_mean"] = pandas_df["age"] - mean_age

    # 输出结果 DataFrame
    return pandas_df

:::callout{theme="neutral"} 要在保存后保持排序后的 pandas DataFrame 的顺序，请将其保存为具有单个分区的 Spark DataFrame：

import pyspark.pandas as p

return p.from_pandas(df).to_spark().coalesce(1)

:::

R 简介¶

R 转换¶

R 转换(transform)定义为一个 R 函数，可以包含任意数量的输入、最多一个输出，以及可选的一个或多个可视化。通过在函数参数中引用转换的别名(alias)，Code Workbook 会自动将该别名对应的输出作为转换的输入传递。有关 Code Workbook 中转换的更多信息，请参阅转换概述文档。

一个简单的 R 转换示例可能包含一个父级 R data.frame，使用 R 转换数据，并输出一个 R data.frame。

child <- def(r_dataframe) {
    library(tidyverse)
    new_df <- r_dataframe %>% dplyr::select(col_A, col_B) %>% dplyr::filter(col_A == true) %>% dplyr::mutate(new_column=1000)
    return(new_df)
}

Spark 与 R data.frame 之间的转换¶

在 R 转换中，Spark DataFrame 与 R data.frame 之间的转换非常简单：

# 从 Spark DataFrame 转换为 R data.frame
new_r_df <- SparkR::collect(spark_df)

# 从 R data.frame 转换为 Spark DataFrame
 spark_df <- SparkR::as.DataFrame(r_df)

请注意，转换为 R data.frame 意味着将数据收集到驱动程序(driver)中。因此，数据大小受 Spark 模块上可用驱动程序内存的限制。如果您处理的是大型数据集，建议先使用 Spark 对数据进行过滤和聚合，然后再将其收集到 R data.frame 中。

output_dataset <- function(spark_df) {
    library(tidyverse)

    # 使用 SparkR 过滤数据
    input_dataset_filtered <- SparkR::select(spark_df, 'column_A', 'column_B')

    # 转换为 R data.frame
    local_df <- SparkR::collect(input_dataset_filtered)

    # 使用 tidyverse 函数转换数据
    local_df <- local_df %>% dplyr::filter(column_A == true) %>% dplyr::mutate(new_column = 1000)

    # 输出 R data.frame
    return(local_df)
}

R 故障排除¶

当 Code Workbook 中导入的数据集以 R data.frame 形式读取时，该数据集会通过收集到驱动程序的方式从 Spark DataFrame 转换为 R data.frame。

您无法将任意大的数据读取为 R data.frame。数据大小受 Spark 模块上驱动程序内存的限制。要处理大型数据，请考虑先将数据集作为 Spark DataFrame 读取，使用 SparkR 将数据转换为更小的形式，然后调用 SparkR::collect() 将其转换为 R data.frame。或者，在使用 R 之前，使用 Python 或 SQL 将数据转换为更小的形式。
将某些数据类型收集到 R data.frame 时存在一些已知问题。在大多数情况下，收集时我们会使用一个名为 r-arrow ↗ 的库，该库可加速序列化和反序列化。特别是，使用 r-arrow 时，Long、Array、Map、Struct 和 Datetime 类型无法转换。请考虑删除这些列或将其转换为其他数据类型（例如 String）。当尝试将这些类型的输入作为 R data.frame 读取时，您将在界面中收到警告。

Code Workbook 中的 R 是单线程的，这意味着在同一 Spark 模块上一次只能运行一个 R 作业。如果您同时启动多个 R 作业，它们将串行运行；排队中的作业将显示为"在 Code Workbook 中排队(Queueing in Code Workbook)"。

如果您有长时间运行的作业，或者转换被保存为数据集的作业，我们建议您运行批量构建(batch build)。批量构建将在其自己的 Spark 模块上运行，允许您继续在同一工作簿或共享该 Spark 模块的其他工作簿中进行迭代。

SQL 简介¶

Code Workbook 支持的 SQL 变体为 Spark SQL ↗。唯一支持的输入和输出类型是 Spark DataFrame。

一个简单的 SQL 转换示例可以在连接键上连接两个输入 DataFrame。

SELECT table_b.col_A, table_b.col_B, table_a.*
FROM table_a
JOIN table_b ON table_a.col_C == table_B.col_C

要向 SQL 节点添加父节点，仅在代码中引用别名(alias)是不够的。您必须使用用户界面，通过选择输入栏，或使用 + 按钮创建子节点。