FAQ(常见问题解答)¶
Can I use PySpark in Code Workspaces?¶
Yes, you can use PySpark in Jupyter® Code Workspaces. However, it runs in local mode inside your workspace container and uses only the available CPUs and memory; it does not run on a distributed Spark cluster. We recommend Foundry applications like Pipeline Builder and Code Repositories to run distributed Spark workloads.
To install PySpark in your JupyterLab® Notebook, navigate to the Libraries tab and install both pyspark and openjdk. The following OpenJDK versions are supported:
- PySpark < 4.0.0: OpenJDK
8.x,11.x, or17.x - PySpark 4.0.0 and above: OpenJDK
17.xor21.x
:::callout{theme="neutral"}
If you encounter an error such as java.lang.UnsupportedOperationException: getSubject is supported only if a security manager is allowed, ensure that OpenJDK is pinned to a supported version for your PySpark release.
:::
Use the following code to create a Spark session, read raw parquet files into a Spark dataframe, and then apply transformations. This approach is useful for drafting PySpark transformations that you want to apply in a Code Repositories PySpark transform.
from foundry.transforms import Dataset
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
# Create the Spark context. The example below shows a 2 CPU, 16 GB workspace.
spark = (SparkSession.builder
.appName("ParquetExample")
.master("local[2]") # Set to the CPUs available.
.config("spark.driver.memory", "10g") # Define the driver memory, around 50-65% of the container's total.
.config("spark.driver.memoryOverhead", "4g") # Define the overhead memory, about 20-35% of the container's total.
.config("spark.sql.shuffle.partitions", "6") # Optionally define partition count, ~3x the CPU total.
.getOrCreate())
sc = spark.sparkContext
# Download a Foundry dataset of parquet files into the notebook.
parquet_foundry = Dataset.get("parquet_foundry_dataset")
files = parquet_foundry.files().download()
# Read the files into a Spark dataframe.
spark_df = spark.read.parquet(*list(files.values()))
# Apply PySpark transformations.
df_processed = spark_df.filter(F.col("col1") == 1)
# Write your output dataset.
spark_output = Dataset.get("spark_output")
spark_output.write_table(df_processed.toArrow())
Can I make API calls in Code Workspaces?¶
Yes, you can make API calls in Code Workspaces using external systems, which have succeeded network policies as the preferred method of egress from a workspace. For CBAC-enabled environments you should use network policies configured in Control Panel.
Which IDEs are supported by Code Workspaces?¶
Code Workspaces currently supports JupyterLab® and RStudio® Workbench.
Which packages are not supported by Code Workspaces?¶
Due to security reasons, the following Python packages are not supported:
- folium
- pandasgui
Contact your Palantir representative if you have any concerns about the packages above.
Why does the code in my repository have JSON formatting?¶
The Code Repositories application receives code from associated Code Workspaces in an IPython format, which renders the code at a cell-by-cell level in JSON format.
Can I use my own packages?¶
Yes; see the documentation on importing packages. If your package is hosted on an organizational Conda/PyPI/CRAN channel, it is possible for Foundry to proxy the channel and make it available to your projects. Contact your Palantir representative for more information.
Why does the Code Repository backing my Code Workspace not have a Libraries tab?¶
To import libraries into your Code Workspace, use the Packages tab located in the left panel of your workspace.
Can I write or edit code in Code Repositories that does not come from my Code Workspace?¶
Yes, you can edit code directly in Code Repositories when the code originates in Code Workspaces. Once committed, you can use the Sync or Reset changes functionality in the Code Workspace to pick up the remote changes in the Workspace.
Conceptually, you can think of Code Repositories as the version control manager for Code Workspaces, handling pull requests, conflict resolution, and administration, while code development can occur in Code Workspaces.
Why do my colleagues see a different view compared to mine?¶
For security purposes, users are isolated when working in JupyterLab® or RStudio®. This means each user accessing the same Code Workspace will have their own environment. Collaboration happens through git workflows: if you wish to make your latest code available to colleagues, select Sync Changes to synchronize your changes with the backing code repository and the changes will become available to your colleagues when they select Sync or Reset changes. When multiple users work on the same workspace, we recommend they work on independent branches.
Note that we ignore some files by default using .gitignore to ensure that no data is synchronized with the git repository, and to limit the size of the git repository. We also remove all outputs from JupyterLab® .ipynb files.
Can I import extra LaTeX packages to render my R Markdown file?¶
Yes. By default, Code Workspaces only pre-installs the requirements to render common R Markdown files (as defined in TinyTex-1 ↗), but you can install other LaTeX packages on your own:
- Download or create the TDS archive. For instance, you can download a
.tds.zipfile from CTAN ↗. - Upload the TDS archives to a new or existing Foundry dataset. You can drag and drop files in a folder to upload them to a new Foundry dataset.
- Select Upload to a dataset without a schema. You may upload multiple TDS archives simultaneously to the same dataset.
- In Code Workspaces, go to the Data tab, then Read existing datasets. Select the dataset containing the TDS archives.
- Optionally, update the dataset alias. In this example, we assume you named your dataset alias
my_latex_packages. - Copy the code snippet, paste it in your
.Rprofile, and update it to also extract the files to the correct location. You will need to replacemy_latex_packageswith your dataset alias in the below snippet:
library(foundry)
my_latex_packages_files <- datasets.list_files("my_latex_packages")
my_latex_packages_local_files <- datasets.download_files("my_latex_packages", my_latex_packages_files)
texmflocal <- system("kpsewhich --var-value TEXMFLOCAL", intern = TRUE)
sapply(my_latex_packages_local_files, function(tds_file) { unzip(tds_file, exdir = texmflocal) })
- You can now include your package in the R Markdown header to use it:
headers-include:
- \usepackage{my_package}
Can I use custom fonts in my R Markdown file?¶
Yes.
- Download or create the TTF files you want to use.
- Upload the TTF files to a new or existing Foundry dataset. You can drag and drop files in a folder to upload them to a new Foundry dataset.
- Select Upload to a dataset without a schema. You may upload multiple TTF files simultaneously to the same dataset.
- In Code Workspaces, go to the Data tab, then Read existing datasets. Select the dataset containing the TTF files.
- Optionally, update the dataset alias. In this example, we assume you named your dataset alias
my_fonts. - Copy the code snippet, run it in the R Console, and copy the files to your repo so they can be tracked by the version control. You will need to replace
my_fontswith your dataset alias in the below snippet:
my_fonts_files <- datasets.list_files("my_fonts")
my_fonts_local_files <- datasets.download_files("my_fonts", my_fonts_files)
dir.create("fonts")
file.copy(unlist(my_fonts_local_files), "fonts")
- You can now reference the fonts in the
/home/user/repo/fontsdirectory so they can be used in your HTML or PDF outputs following the R Markdown recommendations.
To change the font for HTML, set the font-family in a custom CSS ↗.
To change the font for PDFs, you can load the font in LaTeX with \newfontfamily using custom LaTeX code ↗ in the preamble.
How can I prevent my Jupyter® workspace from pausing while my code is running?¶
You can leverage the %%keep_alive cell magic to prevent Code Workspaces from pausing a Jupyter® notebook while the code in the cell is running.
%%keep_alive
long_running_process()
This will keep the cell alive for up to 24 hours while the cell code is running. If your browser tab is closed or left idle for too long, JupyterLab® will stop displaying and persisting the cell's output. If you would like to view the cell's output later, you may use the %%capture cell magic to store the output in a variable:
%%keep_alive
%%capture cell_output
long_running_process()
In a different cell, write the output of the above variable to a file so it can be viewed later:
with open('cell_output.txt', 'w+') as f:
f.write(cell_output.stdout)
f.write(cell_output.stderr)
How can I download files from my Jupyter or RStudio Code Workspace?¶
The IDE native file downloads are disabled in the Palantir platform to ensure that download restrictions configured by administrators are enforced.
In order to download files, write the files back to a Foundry dataset. To do so, in the Code Workspaces sidebar, navigate to the Data tab, then Write data to new dataset, and enter a name for the new dataset. After confirming the alias for the dataset, select Non-tabular dataset as dataset type, enter the location of the files or directory you wish to upload and run the generated code snippet.
Once the files have been uploaded to the Foundry dataset, you can navigate to the dataset preview by selecting on the dataset link and download the files from there. Remember that your administrator may require a justification before exporting data from the platform or limit the amount of data exported.
RStudio® and Shiny® are trademarks of Posit™.
Jupyter®, JupyterLab®, and the Jupyter® logos are trademarks or registered trademarks of NumFOCUS.
All third-party trademarks (including logos and icons) referenced remain the property of their respective owners. No affiliation or endorsement is implied.
中文翻译¶
常见问题解答¶
我可以在 Code Workspaces 中使用 PySpark 吗?¶
可以,你可以在 Jupyter® Code Workspaces 中使用 PySpark。但请注意,它会在你的工作区容器内以本地模式运行,仅使用可用的 CPU 和内存,而不会在分布式 Spark 集群上运行。我们推荐使用 Foundry 应用,如 Pipeline Builder 和 Code Repositories,来运行分布式 Spark 工作负载。
要在你的 JupyterLab® Notebook 中安装 PySpark,请导航至 Libraries 选项卡,并同时安装 pyspark 和 openjdk。支持以下 OpenJDK 版本:
- PySpark < 4.0.0: OpenJDK
8.x、11.x或17.x - PySpark 4.0.0 及以上版本: OpenJDK
17.x或21.x
:::callout{theme="neutral"}
如果你遇到类似 java.lang.UnsupportedOperationException: getSubject is supported only if a security manager is allowed 的错误,请确保 OpenJDK 已固定到你的 PySpark 版本所支持的版本。
:::
使用以下代码创建一个 Spark 会话,将原始 parquet 文件读取到 Spark 数据框(dataframe)中,然后应用转换。这种方法对于草拟你希望在 Code Repositories PySpark 转换中应用的 PySpark 转换非常有用。
from foundry.transforms import Dataset
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
# 创建 Spark 上下文。以下示例显示了一个 2 CPU、16 GB 的工作区。
spark = (SparkSession.builder
.appName("ParquetExample")
.master("local[2]") # 设置为可用的 CPU 数量。
.config("spark.driver.memory", "10g") # 定义驱动内存,约为容器总内存的 50-65%。
.config("spark.driver.memoryOverhead", "4g") # 定义开销内存,约为容器总内存的 20-35%。
.config("spark.sql.shuffle.partitions", "6") # 可选地定义分区数,约为 CPU 总数的 3 倍。
.getOrCreate())
sc = spark.sparkContext
# 将 parquet 文件的 Foundry 数据集下载到 notebook 中。
parquet_foundry = Dataset.get("parquet_foundry_dataset")
files = parquet_foundry.files().download()
# 将文件读取到 Spark 数据框中。
spark_df = spark.read.parquet(*list(files.values()))
# 应用 PySpark 转换。
df_processed = spark_df.filter(F.col("col1") == 1)
# 写入你的输出数据集。
spark_output = Dataset.get("spark_output")
spark_output.write_table(df_processed.toArrow())
我可以在 Code Workspaces 中进行 API 调用吗?¶
可以,你可以使用 外部系统 在 Code Workspaces 中进行 API 调用,外部系统已成为工作区出站(egress)的首选方法,取代了网络策略。对于 启用 CBAC 的环境,你应该使用在 Control Panel 中配置的网络策略。
Code Workspaces 支持哪些 IDE?¶
Code Workspaces 目前支持 JupyterLab® 和 RStudio® Workbench。
Code Workspaces 不支持哪些包?¶
出于安全原因,不支持以下 Python 包:
- folium
- pandasgui
如果你对上述包有任何疑问,请联系你的 Palantir 代表。
为什么我的仓库中的代码是 JSON 格式?¶
Code Repositories 应用从关联的 Code Workspaces 接收 IPython 格式的代码,这会将代码以逐单元格(cell-by-cell)的方式渲染为 JSON 格式。
我可以使用自己的包吗?¶
可以;请参阅关于导入包的文档。如果你的包托管在组织的 Conda/PyPI/CRAN 频道上,Foundry 可以代理该频道并使其对你的项目可用。更多信息请联系你的 Palantir 代表。
为什么支持我的 Code Workspace 的 Code Repository 没有 Libraries 选项卡?¶
要将库导入你的 Code Workspace,请使用位于工作区左侧面板的 Packages 选项卡。
我可以在 Code Repositories 中编写或编辑并非来自我的 Code Workspace 的代码吗?¶
可以,当代码源自 Code Workspaces 时,你可以直接在 Code Repositories 中编辑代码。提交后,你可以使用 Code Workspace 中的 Sync 或 Reset changes 功能来获取工作区中的远程更改。
从概念上讲,你可以将 Code Repositories 视为 Code Workspaces 的版本控制管理器,处理拉取请求、冲突解决和管理,而代码开发可以在 Code Workspaces 中进行。
为什么我的同事看到的视图与我的不同?¶
出于安全目的,用户在 JupyterLab® 或 RStudio® 中工作时是相互隔离的。这意味着访问同一 Code Workspace 的每个用户都将拥有自己的环境。协作通过 git 工作流进行:如果你希望让同事获得你的最新代码,请选择 Sync Changes 将你的更改与底层代码仓库同步,当你的同事选择 Sync 或 Reset changes 时,这些更改将对他们可用。当多个用户在同一个工作区工作时,我们建议他们在独立的分支上工作。
请注意,我们默认使用 .gitignore 忽略某些文件,以确保没有数据与 git 仓库同步,并限制 git 仓库的大小。我们还会移除 JupyterLab® .ipynb 文件中的所有输出。
我可以导入额外的 LaTeX 包来渲染我的 R Markdown 文件吗?¶
可以。默认情况下,Code Workspaces 仅预安装了渲染常见 R Markdown 文件所需的环境(如 TinyTex-1 ↗ 中所定义),但你可以自行安装其他 LaTeX 包:
- 下载或创建 TDS 归档文件。例如,你可以从 CTAN ↗ 下载一个
.tds.zip文件。 - 将 TDS 归档文件上传到一个新的或现有的 Foundry 数据集。你可以将文件拖放到文件夹中,以上传到新的 Foundry 数据集。
- 选择 Upload to a dataset without a schema。你可以同时将多个 TDS 归档文件上传到同一个数据集。
- 在 Code Workspaces 中,转到 Data 选项卡,然后选择 Read existing datasets。选择包含 TDS 归档文件的数据集。
- 可选地,更新数据集别名。在此示例中,我们假设你将数据集别名命名为
my_latex_packages。 - 复制代码片段,将其粘贴到你的
.Rprofile中,并更新它以将文件解压到正确的位置。你需要在下面的代码片段中将my_latex_packages替换为你的数据集别名:
library(foundry)
my_latex_packages_files <- datasets.list_files("my_latex_packages")
my_latex_packages_local_files <- datasets.download_files("my_latex_packages", my_latex_packages_files)
texmflocal <- system("kpsewhich --var-value TEXMFLOCAL", intern = TRUE)
sapply(my_latex_packages_local_files, function(tds_file) { unzip(tds_file, exdir = texmflocal) })
- 现在,你可以将你的包包含在 R Markdown 头部以使用它:
headers-include:
- \usepackage{my_package}
我可以在我的 R Markdown 文件中使用自定义字体吗?¶
可以。
- 下载或创建你想要使用的 TTF 文件。
- 将 TTF 文件上传到一个新的或现有的 Foundry 数据集。你可以将文件拖放到文件夹中,以上传到新的 Foundry 数据集。
- 选择 Upload to a dataset without a schema。你可以同时将多个 TTF 文件上传到同一个数据集。
- 在 Code Workspaces 中,转到 Data 选项卡,然后选择 Read existing datasets。选择包含 TTF 文件的数据集。
- 可选地,更新数据集别名。在此示例中,我们假设你将数据集别名命名为
my_fonts。 - 复制代码片段,在 R 控制台中运行它,并将文件复制到你的仓库,以便版本控制可以跟踪它们。你需要在下面的代码片段中将
my_fonts替换为你的数据集别名:
my_fonts_files <- datasets.list_files("my_fonts")
my_fonts_local_files <- datasets.download_files("my_fonts", my_fonts_files)
dir.create("fonts")
file.copy(unlist(my_fonts_local_files), "fonts")
- 现在,你可以引用
/home/user/repo/fonts目录中的字体,以便按照 R Markdown 的建议在 HTML 或 PDF 输出中使用它们。
要更改 HTML 的字体,请在自定义 CSS ↗ 中设置 font-family。
要更改 PDF 的字体,你可以使用自定义 LaTeX 代码 ↗ 在序言中使用 \newfontfamily 在 LaTeX 中加载字体。
如何防止我的 Jupyter® 工作区在代码运行时暂停?¶
你可以利用 %%keep_alive 单元格魔法(cell magic)来防止 Code Workspaces 在单元格中的代码运行时暂停 Jupyter® notebook。
%%keep_alive
long_running_process()
这将在单元格代码运行时保持单元格活动长达 24 小时。如果你的浏览器标签页关闭或闲置时间过长,JupyterLab® 将停止显示和持久化单元格的输出。如果你希望稍后查看单元格的输出,你可以使用 %%capture 单元格魔法将输出存储在一个变量中:
%%keep_alive
%%capture cell_output
long_running_process()
在另一个单元格中,将上述变量的输出写入文件,以便稍后查看:
with open('cell_output.txt', 'w+') as f:
f.write(cell_output.stdout)
f.write(cell_output.stderr)
如何从我的 Jupyter 或 RStudio Code Workspace 下载文件?¶
Palantir 平台禁用了 IDE 原生的文件下载功能,以确保管理员配置的下载限制得到执行。
要下载文件,请将文件写回 Foundry 数据集。为此,在 Code Workspaces 侧边栏中,导航至 Data 选项卡,然后选择 Write data to new dataset,并输入新数据集的名称。确认数据集别名后,选择 Non-tabular dataset 作为数据集类型,输入你想要上传的文件或目录的位置,然后运行生成的代码片段。
文件上传到 Foundry 数据集后,你可以通过点击数据集链接导航到数据集预览,并从那里下载文件。 请记住,你的管理员可能需要在从平台导出数据之前提供理由,或限制导出的数据量。
RStudio® 和 Shiny® 是 Posit™ 的商标。
Jupyter®、JupyterLab® 以及 Jupyter® 标志是 NumFOCUS 的商标或注册商标。
所有引用的第三方商标(包括标志和图标)均归其各自所有者所有。不暗示任何隶属关系或认可。