Apply Spark profiles(应用Spark配置文件(Spark profiles))¶
You may want to apply custom Spark properties to your transforms jobs.
To apply the Spark properties to a specific job:
- Follow the guide for importing the Spark profile into your repository.
- Reference the transforms profile in your code as documented below.
You can learn more about the characteristics of the default Spark profiles available in the Spark Profiles Reference section.
Note also the recommended best practices for adjusting Spark profiles.
Transforms profile syntax¶
Specifying custom Spark profiles is supported in all languages. In all of the cases below, settings are evaluated from left to right. If multiple profiles specify the same setting, the one closer to the end of the list will take precedence.
Python¶
You can reference the profile1 and profile2 profiles in your Python code by using the configure decorator to wrap your transform object. This decorator takes in a profile parameter that refers to the list of your custom transforms profiles:
from transforms.api import transform, Input, Output
@transform.spark.using(
# your input dataset(s)
my_input=Input("/path/to/input/dataset"),
# your output dataset
my_output=Output("/path/to/output/dataset"),
).with_config(profiles=['profile1', 'profile2'])
# your data transformation code
def my_compute_function(my_input, my_output):
return my_output.write_dataframe(my_input.dataframe())
Java¶
Auto-registered transforms can reference the profile1 and profile2 profiles in your Java code by using the TransformProfiles annotation in your compute function. This annotation takes in a parameter that refers to the array of your custom Spark profiles:
import com.palantir.transforms.lang.java.api.TransformProfiles;
/**
* This is an example low-level Transform intended for automatic registration.
*/
public final class LowLevel {
@Compute
@TransformProfiles({ "profile1", "profile2" })
public void myComputeFunction(
@Input("/path/to/input/dataset") FoundryInput myInput,
@Output("/path/to/output/dataset") FoundryOutput myOutput) {
Dataset<Row> limited = myInput.asDataFrame().read().limit(10);
myOutput.getDataFrameWriter(limited).write();
}
}
Alternatively, if you are using manual registration, you can use the builder method transformProfiles():
public final class MyPipelineDefiner implements PipelineDefiner {
@Override
public void define(Pipeline pipeline) {
LowLevelTransform lowLevelManualTransform = LowLevelTransform.builder()
.transformProfiles(ImmutableList.of("profile1", "profile2"))
// Pass in the compute function to use. Here, "LowLevelManualFunction" corresponds
// to the class name for a compute function for a low-level Transform.
.computeFunctionInstance(new LowLevelManualFunction())
.putParameterToInputAlias("myInput", "/path/to/input/dataset")
.putParameterToOutputAlias("myOutput", "/path/to/output/dataset")
.build();
pipeline.register(lowLevelManualTransform);
}
}
SQL¶
You can reference the profile1 and profile2 profiles in your SQL code by setting the foundry_transform_profiles property for your table:
CREATE TABLE `/path/to/output` TBLPROPERTIES (foundry_transform_profiles = 'profile1, profile2')
AS SELECT * FROM `/path/to/input`
Here is another example using alternative SQL syntax:
CREATE TABLE `/path/to/output` USING foo_bar OPTIONS (foundry_transform_profiles = 'profile1, profile2')
AS SELECT * FROM `/path/to/input`;
Note that specifying custom transforms profiles is not currently supported in ANSI SQL mode.
中文翻译¶
应用Spark配置文件(Spark profiles)¶
你可能需要为转换作业(transforms jobs)应用自定义Spark属性。
要将Spark属性应用到特定作业,请按以下步骤操作: 1. 按照导入Spark配置文件指南,将配置文件导入你的代码仓库(repository)。 2. 按照下文转换配置文件语法的说明在代码中引用转换配置文件。
你可以前往Spark配置文件参考章节,了解平台默认提供的Spark配置文件的更多特性。 同时建议你参考调整Spark配置文件的推荐最佳实践。
转换配置文件语法(Transforms profile syntax)¶
所有支持的开发语言都支持指定自定义Spark配置文件。以下所有场景中,配置项会按从左到右的顺序解析,若多个配置文件定义了同一个配置项,列表中位置越靠后的配置项优先级越高。
Python¶
你可以使用configure装饰器封装转换对象,在Python代码中引用profile1和profile2配置文件。该装饰器接收profile参数,参数值为自定义转换配置文件组成的列表:
from transforms.api import transform, Input, Output
@transform.spark.using(
# your input dataset(s)
my_input=Input("/path/to/input/dataset"),
# your output dataset
my_output=Output("/path/to/output/dataset"),
).with_config(profiles=['profile1', 'profile2'])
# your data transformation code
def my_compute_function(my_input, my_output):
return my_output.write_dataframe(my_input.dataframe())
Java¶
自动注册的转换可以在Java代码中,在计算函数上添加TransformProfiles注解,来引用profile1和profile2配置文件。该注解接收的参数值为自定义Spark配置文件组成的数组:
import com.palantir.transforms.lang.java.api.TransformProfiles;
/**
* This is an example low-level Transform intended for automatic registration.
*/
public final class LowLevel {
@Compute
@TransformProfiles({ "profile1", "profile2" })
public void myComputeFunction(
@Input("/path/to/input/dataset") FoundryInput myInput,
@Output("/path/to/output/dataset") FoundryOutput myOutput) {
Dataset<Row> limited = myInput.asDataFrame().read().limit(10);
myOutput.getDataFrameWriter(limited).write();
}
}
如果你使用手动注册方式,可以调用builder的transformProfiles()方法实现:
public final class MyPipelineDefiner implements PipelineDefiner {
@Override
public void define(Pipeline pipeline) {
LowLevelTransform lowLevelManualTransform = LowLevelTransform.builder()
.transformProfiles(ImmutableList.of("profile1", "profile2"))
// Pass in the compute function to use. Here, "LowLevelManualFunction" corresponds
// to the class name for a compute function for a low-level Transform.
.computeFunctionInstance(new LowLevelManualFunction())
.putParameterToInputAlias("myInput", "/path/to/input/dataset")
.putParameterToOutputAlias("myOutput", "/path/to/output/dataset")
.build();
pipeline.register(lowLevelManualTransform);
}
}
SQL¶
你可以通过为表设置foundry_transform_profiles属性,在SQL代码中引用profile1和profile2配置文件:
CREATE TABLE `/path/to/output` TBLPROPERTIES (foundry_transform_profiles = 'profile1, profile2')
AS SELECT * FROM `/path/to/input`
以下是使用另一种SQL语法的示例:
CREATE TABLE `/path/to/output` USING foo_bar OPTIONS (foundry_transform_profiles = 'profile1, profile2')
AS SELECT * FROM `/path/to/input`;