Project structure(项目结构)¶
Default bootstrapped repository¶
Here is the standard structure for a bootstrapped Python transforms repository:
transforms-python
├── conda_recipe
│ └── meta.yaml
└── src
├── myproject
│ ├── __init__.py
│ ├── datasets
│ │ ├── __init__.py
│ │ └── examples.py
│ └── pipeline.py
├── setup.cfg
└── setup.py
There are also additional files inside the repository that can be viewed by going to the Settings cog in the File Explorer tab in Code Repositories and selecting Show hidden files and folders. In almost all cases, these hidden files should not be edited; Palantir does not provide support for repositories with custom changes to these hidden files.
You can learn more about the following files below:
:::callout{theme="neutral"} Make sure you go through the getting started guide before reading on. Also, this page assumes that you are using the default project structure that is included in a bootstrapped Python transforms repository. :::
Repository upgrade file changes¶
When you create the repository for the first time, it is bootstrapped with the default contents of the latest Python transforms template version at that time. During subsequent repository upgrades, files in the repository are upgraded to align with the contents of the most recent Python transforms template version. Custom user changes to these files may be overwritten during the new upgrade template to ensure consistency. We do not support custom changes to these files as it can lead to unexpected behavior.
The following files will not be overwritten by a repository upgrade:
- Default files in the
conda_recipeandsrcfolders - Inner and outer
build.gradlefiles
The following files will be merged with the newest Python template file during a repository upgrade. In the case of any common keys, the Python template's version is chosen:
gradle.propertiesversions.properties
The remaining files will be overwritten by the upgrade to match the files of the newest Python template versions.
pipeline.py¶
In this file, you define your project’s Pipeline, which is a registry of the Transform objects associated with your data transformations. Here is the default src/myproject/pipeline.py file:
from transforms.api import Pipeline
from myproject import datasets
my_pipeline = Pipeline()
my_pipeline.discover_transforms(datasets)
Note that the default pipeline.py file uses automatic registration to add your Transform objects to your project’s Pipeline. Automatic registration discovers all Transform objects in your project’s datasets package. Thus, if you re-structure your project such that your transformation logic is not contained within the datasets folder, make sure to update your src/myproject/pipeline.py file appropriately.
Alternatively, you can explicitly add each of your Transform objects to your project’s Pipeline using manual registration. Unless you have a workflow that requires you to explicitly add each Transform object to your Pipeline, it’s recommended to use automatic registration. For more information about Pipeline objects, refer to the section describing Pipelines.
setup.py¶
In this file, you define a transforms.pipeline entry point named root that’s associated with your project’s Pipeline—this allows Python transforms to discover your project’s Pipeline. Here is the default src/setup.py file:
import os
from setuptools import find_packages, setup
setup(
name=os.environ['PKG_NAME'],
version=os.environ['PKG_VERSION'],
description='Python data transformation project',
# Modify the author for this project
author='{{REPOSITORY_ORG_NAME}}',
packages=find_packages(exclude=['contrib', 'docs', 'test']),
# Instead, specify your dependencies in conda_recipe/meta.yml
install_requires=[],
entry_points={
'transforms.pipelines': [
'root = myproject.pipeline:my_pipeline'
]
}
)
If you modify the default project structure, you may need to modify the content in your src/setup.py file. For more information, refer to the section describing the transforms.pipeline entry point.
examples.py¶
This file contains your data transformation code. Here is an uncommented version of the default src/myproject/datasets/examples.py file:
from transforms.api import Input, Output, transform, LightweightInput, LightweightOutput
@transform.using(
output_dataset=Output("/path/to/output/dataset"),
input_dataset=Input("/path/to/input/dataset"),
)
def compute(input_dataset: LightweightInput, output_dataset: LightweightOutput) -> None:
output_dataset.write_table(input_dataset.polars(lazy=True))
After un-commenting the sample code, you can replace /path/to/input/dataset and /path/to/out/dataset with the full paths to your input and output datasets, respectively. If your data transformation relies on multiple datasets, you can provide additional input datasets. You can update the compute function to contain the code to transform your input dataset(s) to your output dataset. Also, keep in mind that a single Python file supports the creation of multiple output datasets.
Note that the sample code uses the Polars compute engine. You can also modify the code to use other compute options. See Compute engines for more details on compute options, and Getting started for code examples across all engines.
For more information about creating Transform objects, which describe your input and output datasets as well as your transformation logic, refer to the section describing Transforms.
meta.yaml¶
A conda build recipe is a directory containing all the metadata and scripts required to build a conda ↗ package. One of the files in the build recipe is meta.yaml—this file contains all the metadata. For more information about the structure of this file, refer to the conda documentation on the meta.yaml file ↗.
Here is the default conda_recipe/meta.yaml file:
# If you need to modify the runtime requirements for your package,
# update the 'requirements.run' section in this file
package:
name: "{{ PACKAGE_NAME }}"
version: "{{ PACKAGE_VERSION }}"
source:
path: ../src
requirements:
# Tools required to build the package. These packages are run on the build system and include
# things such as revision control systems (Git, SVN) make tools (GNU make, Autotool, CMake) and
# compilers (real cross, pseudo-cross, or native when not cross-compiling), and any source pre-processors.
# https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#build
build:
- python 3.9.*
- setuptools
# Packages required to run the package. These are the dependencies that are installed automatically
# whenever the package is installed.
# https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#run
run:
- python 3.9.*
- transforms {{ PYTHON_TRANSFORMS_VERSION }}
- transforms-expectations
- transforms-verbs
build:
script: python setup.py install --single-version-externally-managed --record=record.txt
If your Python transforms project requires any additional build dependencies, you can use the package tab to discover available packages and automatically add these to your meta.yml file as described in the documentation on sharing Python libraries. This step will automatically detect the channel that produces the package you are trying to import and it will add it as a backing repository.
It is also possible to manually update the "requirements" section in this file. However, it is strongly recommended not to do so manually as you run the risk of requesting packages and versions that are not available, and which will subsequently cause Checks to fail on your repository. For any dependencies that you add, make sure that the required packages for your dependencies are available.
Note that it is unlikely you will need to modify sections other than "requirements".
Supported Python 3 versions¶
Palantir supports active versions of Python, adhering to the Python Software Foundation's end-of-life schedule. Refer to the Python version support page for more details.
Example usage:
requirements:
build:
- python 3.9.*
- setuptools
# Any extra packages required to run your package.
run:
- python 3.9.*
- transforms {{ PYTHON_TRANSFORMS_VERSION }}
:::callout{theme="neutral"}
* Be sure that the Python dependencies in the build and run sections are identical. Mismatches between the Python dependencies can lead to undesired outcomes and failures.
* Ranges such as python >=3.9 or python >3.9,<=3.10.11 are not supported for Python versions.
:::
Pinning run-time versions¶
If your transforms require a specific library version to be pinned, and you wanted to manually add this rather than using the recommended package tab, you can explicitly specify this alongside the library name in the requirements block. Below is an example pinning:
requirements:
run:
# The below pins an explicit version
- mylibrary 1.0.1
# The below specifies a maximum version (version equal or lower):
- scipy <=1.4.0
Note:
- No space is allowed after the operator. e.g.
scipy <= 1.4.0will fail CI checks. - The operator
>=for versions is not yet supported in Foundry.
Using pip-managed dependencies¶
If your transform requires a specific library that is not available through Conda, but is available when installed using pip ↗, you can declare those in the additional pip section. The dependency will be installed on top of your Conda environment. Below is an example of adding a pip dependency:
requirements:
build:
- python 3.9.*
- setuptools
run:
- python 3.9.*
- transforms {{ PYTHON_TRANSFORMS_VERSION }}
pip:
- pypicloud
Note:
- Dependencies added to the
pipsection are installed on top of the Conda environment that is derived from the packages in therunsection. Therefore, removingrunorbuildwould cause failures. - The
pipsection can only be used in Python transforms repositories, and cannot be used in Python libraries.
中文翻译¶
项目结构¶
默认引导仓库¶
以下是引导式 Python 转换(transforms)仓库的标准结构:
transforms-python
├── conda_recipe
│ └── meta.yaml
└── src
├── myproject
│ ├── __init__.py
│ ├── datasets
│ │ ├── __init__.py
│ │ └── examples.py
│ └── pipeline.py
├── setup.cfg
└── setup.py
仓库中还包含其他文件,可通过在代码仓库(Code Repositories)的文件浏览器(File Explorer)选项卡中点击设置(Settings)齿轮图标,然后选择显示隐藏文件和文件夹(Show hidden files and folders)进行查看。在绝大多数情况下,不应编辑这些隐藏文件;Palantir 不提供对自定义修改了这些隐藏文件的仓库的支持。
您可以在下方了解更多关于以下文件的信息:
:::callout{theme="neutral"} 在继续阅读之前,请确保您已阅读过入门指南。此外,本页面假设您使用的是引导式 Python 转换(transforms)仓库中包含的默认项目结构。 :::
仓库升级文件变更¶
当您首次创建仓库时,它会使用当时最新 Python 转换(transforms)模板版本的默认内容进行引导。在后续的仓库升级过程中,仓库中的文件会被升级以匹配最新 Python 转换(transforms)模板版本的内容。为确保一致性,用户对这些文件的自定义修改可能会在升级模板时被覆盖。我们不支持对这些文件进行自定义修改,因为这可能导致意外行为。
以下文件不会被仓库升级覆盖:
conda_recipe和src文件夹中的默认文件- 内层和外层的
build.gradle文件
以下文件在仓库升级过程中会与最新的 Python 模板文件合并。如果存在相同的键(key),则会选择 Python 模板的版本:
gradle.propertiesversions.properties
其余文件将被升级覆盖,以匹配最新 Python 模板版本的文件。
pipeline.py¶
在此文件中,您定义项目的管道(Pipeline),它是与您的数据转换相关联的转换(Transform)对象的注册表。以下是默认的 src/myproject/pipeline.py 文件:
from transforms.api import Pipeline
from myproject import datasets
my_pipeline = Pipeline()
my_pipeline.discover_transforms(datasets)
请注意,默认的 pipeline.py 文件使用自动注册将您的转换(Transform)对象添加到项目的管道(Pipeline)中。自动注册会发现项目 datasets 包中的所有转换(Transform)对象。因此,如果您重新组织项目结构,使得转换逻辑不再包含在 datasets 文件夹中,请确保相应地更新您的 src/myproject/pipeline.py 文件。
或者,您也可以使用手动注册将每个转换(Transform)对象显式添加到项目的管道(Pipeline)中。除非您的工作流要求显式添加每个转换(Transform)对象到管道(Pipeline),否则建议使用自动注册。有关管道(Pipeline)对象的更多信息,请参考描述管道(Pipelines)的章节。
setup.py¶
在此文件中,您定义一个名为 root 的 transforms.pipeline 入口点(entry point),该入口点与项目的管道(Pipeline)相关联——这使得 Python 转换(transforms)能够发现项目的管道(Pipeline)。以下是默认的 src/setup.py 文件:
import os
from setuptools import find_packages, setup
setup(
name=os.environ['PKG_NAME'],
version=os.environ['PKG_VERSION'],
description='Python data transformation project',
# Modify the author for this project
author='{{REPOSITORY_ORG_NAME}}',
packages=find_packages(exclude=['contrib', 'docs', 'test']),
# Instead, specify your dependencies in conda_recipe/meta.yml
install_requires=[],
entry_points={
'transforms.pipelines': [
'root = myproject.pipeline:my_pipeline'
]
}
)
如果您修改了默认的项目结构,可能需要修改 src/setup.py 文件中的内容。更多信息,请参考描述 transforms.pipeline 入口点(entry point)的章节。
examples.py¶
此文件包含您的数据转换代码。以下是默认 src/myproject/datasets/examples.py 文件的未注释版本:
from transforms.api import Input, Output, transform, LightweightInput, LightweightOutput
@transform.using(
output_dataset=Output("/path/to/output/dataset"),
input_dataset=Input("/path/to/input/dataset"),
)
def compute(input_dataset: LightweightInput, output_dataset: LightweightOutput) -> None:
output_dataset.write_table(input_dataset.polars(lazy=True))
取消示例代码的注释后,您可以将 /path/to/input/dataset 和 /path/to/out/dataset 分别替换为输入和输出数据集的完整路径。如果您的数据转换依赖于多个数据集,可以提供额外的输入数据集。您可以更新 compute 函数,使其包含将输入数据集转换为输出数据集的代码。同时请注意,单个 Python 文件支持创建多个输出数据集。
请注意,示例代码使用了 Polars 计算引擎。您也可以修改代码以使用其他计算选项。有关计算选项的更多详细信息,请参阅计算引擎,有关所有引擎的代码示例,请参阅入门指南。
有关创建描述输入和输出数据集以及转换逻辑的转换(Transform)对象的更多信息,请参考描述转换(Transforms)的章节。
meta.yaml¶
Conda 构建配方(build recipe)是一个包含构建 conda ↗ 包所需的所有元数据和脚本的目录。构建配方中的文件之一是 meta.yaml——此文件包含所有元数据。有关此文件结构的更多信息,请参考 conda 关于 meta.yaml 文件的文档 ↗。
以下是默认的 conda_recipe/meta.yaml 文件:
# If you need to modify the runtime requirements for your package,
# update the 'requirements.run' section in this file
package:
name: "{{ PACKAGE_NAME }}"
version: "{{ PACKAGE_VERSION }}"
source:
path: ../src
requirements:
# Tools required to build the package. These packages are run on the build system and include
# things such as revision control systems (Git, SVN) make tools (GNU make, Autotool, CMake) and
# compilers (real cross, pseudo-cross, or native when not cross-compiling), and any source pre-processors.
# https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#build
build:
- python 3.9.*
- setuptools
# Packages required to run the package. These are the dependencies that are installed automatically
# whenever the package is installed.
# https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#run
run:
- python 3.9.*
- transforms {{ PYTHON_TRANSFORMS_VERSION }}
- transforms-expectations
- transforms-verbs
build:
script: python setup.py install --single-version-externally-managed --record=record.txt
如果您的 Python 转换(transforms)项目需要额外的构建依赖项,您可以使用包选项卡来发现可用包,并按照共享 Python 库的文档中的描述自动将这些包添加到您的 meta.yml 文件中。此步骤将自动检测生成您尝试导入的包的渠道(channel),并将其添加为支持仓库(backing repository)。
也可以手动更新此文件中的 "requirements" 部分。但是,强烈建议不要手动操作,因为您可能会请求不可用的包和版本,从而导致仓库的检查(Checks)失败。对于您添加的任何依赖项,请确保依赖项所需的包可用。
请注意,您不太可能需要修改除 "requirements" 之外的其他部分。
支持的 Python 3 版本¶
Palantir 支持活跃的 Python 版本,遵循 Python 软件基金会的生命周期终止计划。有关更多详细信息,请参考 Python 版本支持页面。
使用示例:
requirements:
build:
- python 3.9.*
- setuptools
# Any extra packages required to run your package.
run:
- python 3.9.*
- transforms {{ PYTHON_TRANSFORMS_VERSION }}
:::callout{theme="neutral"}
* 请确保 build 和 run 部分中的 Python 依赖项完全相同。Python 依赖项不匹配可能导致意外结果和失败。
* Python 版本不支持范围表达式,例如 python >=3.9 或 python >3.9,<=3.10.11。
:::
固定运行时版本¶
如果您的转换(transforms)需要固定特定的库版本,并且您希望手动添加而不是使用推荐的包选项卡,您可以在 requirements 块中与库名称一起显式指定。以下是固定版本的示例:
requirements:
run:
# The below pins an explicit version
- mylibrary 1.0.1
# The below specifies a maximum version (version equal or lower):
- scipy <=1.4.0
注意:
- 运算符后不允许有空格。例如,
scipy <= 1.4.0会导致 CI 检查失败。 - Foundry 尚不支持版本运算符
>=。
使用 pip 管理的依赖项¶
如果您的转换(transform)需要特定的库,该库无法通过 Conda 获得,但可以通过 pip ↗ 安装,您可以在额外的 pip 部分中声明这些依赖项。该依赖项将安装在您的 Conda 环境之上。以下是添加 pip 依赖项的示例:
requirements:
build:
- python 3.9.*
- setuptools
run:
- python 3.9.*
- transforms {{ PYTHON_TRANSFORMS_VERSION }}
pip:
- pypicloud
注意:
- 添加到
pip部分的依赖项会安装在从run部分中的包派生的 Conda 环境之上。因此,移除run或build会导致失败。 pip部分仅可用于 Python 转换(transforms)仓库,不能用于 Python 库。