跳转至

Getting started(入门指南)

This page explains the structure and configuration options available when using the Palantir R SDK to write data transformations in Foundry.

Repository structure

Each R transform is defined and configured by a transforms spec, with the definition written in YAML format and saved in the .transforms folder. The full content of the repo project will be available at runtime and can be used for the transform.

.
├── .transforms
|    └── happiness_ranking.yml
|
├── project
|    └── src
|        └── top_10.R
└── .Rprofile

For R transforms published from Code Workspaces, transform spec files are rendered and configurable via the UI.

Write and configure a simple R transform

In the example below, we write a simple data transformation that reads a tabular dataset, applies a filter using the R package dplyr, and produces an output dataset with the filtered result.

library(foundry)
library(dplyr)

happiness_2019 <- datasets.read_table("happiness_2019")
top_10 <- happiness_2019 %>% filter(Overall_rank <= 10)
datasets.write_table(top_10, "top_10_happiest_countries")

Here, we have the corresponding transform spec file that contains the definition of the transform:

Replace the dataset.rid fields with the dataset RID of your input and output datasets. Note that output datasets must first be created manually as we do not currently support creating new output datasets during checks time.

You can add additional inputs and outputs by adding a new item to the inputs or outputs list.

inputs:
- alias: "happiness_2019"
  properties:
    type: "dataset"
    dataset:
      rid: "<input dataset rid here>"
outputs:
- alias: "top_10_happiest_countries"
  properties:
    type: "dataset"
    dataset:
      rid: "<output dataset rid here>"
- alias: "second-output"
  properties:
    type: "dataset"
    dataset:
      rid: "<second output dataset rid here>"
runtime:
  type: "rscript"
  rscript:
    identifier: "R.4.1.x"
    filePath: "project/src/top_10.R"

Runtime configurations:

  • identifier: Identifies the runtime for the transform; for example, R.4.1.x refers to R minor version release 4.1.
  • filePath: The relative file path from the root of the project to the R script file that will be executed.

Write unstructured data

In this example, we use ggplot2 to generate a plot with a linear regression line and save the plot as a PNG to an output dataset.

library(foundry)
library(ggplot2)

happiness_2019 <- datasets.read_table("happiness_2019")

png("/tmp/model_plot.png")
plot <- ggplot(happiness_2019, aes(GDP_per_capita, Overall_rank)) + geom_point() 
plot + stat_smooth(method = "lm", formula = y ~ x, geom = "smooth")
dev.off()

datasets.upload_files("/tmp/model_plot.png", "happinessGDP_plot")
inputs:
- alias: "happiness_2019"
  properties:
    type: "dataset"
    dataset:
      rid: "<input dataset rid here>"
outputs:
- alias: "happinessGDP_plot"
  properties:
    type: "dataset"
    dataset:
      rid: "<output dataset rid here>"
runtime:
  type: "rscript"
  rscript:
    identifier: "R.4.1.x"
    filePath: "project/src/gdp_plot.R"

Manage R packages and environments

By default, R transform repositories are configured with Artifacts Repositories that mirror Posit™ Package Manager and Bioconductor. We recommend that users use renv ↗ to install R packages and manage the R environment for their transforms.

In Code Workspace RStudio® Workbench, renv is installed and available by default. Users can search for available packages in the Packages panel and generate the corresponding R code to install them.

R transforms also support restoring from renv.lock. If there is a renv.lock file present in the repositories, you can run renv.restore() at the beginning of your transform or in .Rprofile to restore the R environment.

The following examples show R commands that can be run to install packages needed for an R transform and generate a renv.lock file to save the state of the project libraries. The generated lock file can then be committed and used to restore the project later.

renv::install("dplyr")
renv::install("arrow")
renv::snapshot()

RStudio® is a trademark of Posit™.

All third-party trademarks (including logos and icons) referenced remain the property of their respective owners. No affiliation or endorsement is implied.


中文翻译

入门指南

本文档介绍了在Foundry中使用Palantir R SDK编写数据转换时的结构及配置选项。

仓库结构

每个R转换由转换规范(transforms spec)定义和配置,该规范以YAML格式编写并保存在.transforms文件夹中。仓库项目的全部内容在运行时均可访问,并可用于转换。

.
├── .transforms
|    └── happiness_ranking.yml
|
├── project
|    └── src
|        └── top_10.R
└── .Rprofile

对于从代码工作台(Code Workspaces)发布的R转换,转换规范文件可通过用户界面进行渲染和配置。

编写并配置一个简单的R转换

在以下示例中,我们编写了一个简单的数据转换:读取表格数据集,使用R包dplyr应用过滤条件,并生成包含过滤结果的输出数据集。

library(foundry)
library(dplyr)

happiness_2019 <- datasets.read_table("happiness_2019")
top_10 <- happiness_2019 %>% filter(Overall_rank <= 10)
datasets.write_table(top_10, "top_10_happiest_countries")

以下是包含该转换定义的对应转换规范文件:

请将dataset.rid字段替换为输入和输出数据集的RID。请注意,输出数据集必须预先手动创建,因为目前我们尚不支持在检查期间创建新的输出数据集。

您可以通过在inputsoutputs列表中添加新条目来增加额外的输入和输出。

inputs:
- alias: "happiness_2019"
  properties:
    type: "dataset"
    dataset:
      rid: "<在此处填写输入数据集rid>"
outputs:
- alias: "top_10_happiest_countries"
  properties:
    type: "dataset"
    dataset:
      rid: "<在此处填写输出数据集rid>"
- alias: "second-output"
  properties:
    type: "dataset"
    dataset:
      rid: "<在此处填写第二个输出数据集rid>"
runtime:
  type: "rscript"
  rscript:
    identifier: "R.4.1.x"
    filePath: "project/src/top_10.R"

运行时配置:

  • identifier:标识转换的运行时;例如,R.4.1.x 指代R次版本发布4.1。
  • filePath:从项目根目录到将要执行的R脚本文件的相对路径。

写入非结构化数据

在此示例中,我们使用ggplot2生成带有线性回归线的图表,并将该图表保存为PNG格式到输出数据集。

library(foundry)
library(ggplot2)

happiness_2019 <- datasets.read_table("happiness_2019")

png("/tmp/model_plot.png")
plot <- ggplot(happiness_2019, aes(GDP_per_capita, Overall_rank)) + geom_point() 
plot + stat_smooth(method = "lm", formula = y ~ x, geom = "smooth")
dev.off()

datasets.upload_files("/tmp/model_plot.png", "happinessGDP_plot")
inputs:
- alias: "happiness_2019"
  properties:
    type: "dataset"
    dataset:
      rid: "<在此处填写输入数据集rid>"
outputs:
- alias: "happinessGDP_plot"
  properties:
    type: "dataset"
    dataset:
      rid: "<在此处填写输出数据集rid>"
runtime:
  type: "rscript"
  rscript:
    identifier: "R.4.1.x"
    filePath: "project/src/gdp_plot.R"

管理R包和环境

默认情况下,R转换仓库配置了镜像Posit™ Package Manager和Bioconductor的制品仓库(Artifacts Repositories)。我们建议用户使用renv ↗来安装R包并管理其转换的R环境。

代码工作台RStudio® Workbench中,renv默认已安装并可用。用户可以在包(Packages)面板中搜索可用包,并生成相应的R代码进行安装。

R转换还支持从renv.lock恢复环境。如果仓库中存在renv.lock文件,您可以在转换开始时或在.Rprofile中运行renv.restore()来恢复R环境。

以下示例展示了可运行的R命令,用于安装R转换所需的包并生成renv.lock文件以保存项目库的状态。生成的锁定文件随后可提交并用于后续恢复项目。

renv::install("dplyr")
renv::install("arrow")
renv::snapshot()

RStudio® 是 Posit™ 的商标。

所有引用的第三方商标(包括徽标和图标)均归其各自所有者所有。不暗示任何关联或认可。