跳转至

API overview(API 概述)

The transforms-tables library provides TableInput and TableOutput parameters to interact with virtual tables in Python transforms. These behave in a similar way to the Input and Output parameters used with Foundry datasets. However, writing to a virtual table requires additional configuration to specify the source and the location within the external system where the table will be stored. The virtual table will be created during checks, as with dataset outputs. Once created, the extra configuration for the source and table metadata can be removed from the TableOutput and replaced with the table RID to be more concise.

from transforms.api import transform
from transforms.tables import TableInput, TableOutput, TableTransformInput, TableTransformOutput, SnowflakeTable


@transform.spark.using(
    source_table=TableInput(alias: str),
    output_table=TableOutput(
        alias: str,  # The Compass path where the virtual table should be registered
        source: str,  # The source RID where the output table should be written
        table: Table,  # The locator defining where the output table should be stored in the external system
    ),
)
def compute(source_table: TableTransformInput, output_table: TableTransformOutput):
    ...  # Normal transforms API

The table: Table parameter in TableOutput defines where the output table should be created in the external system.

:::callout{theme="warning" title="Source and location cannot be changed after creation"} Once a virtual table has been created, it is not possible to change its source or location. Modifying the source or location will cause checks to fail. :::

The available Table subclasses are:

  • BigQueryTable(project: str, dataset: str, table: str)
  • DatabricksTable(catalog: str, schema: str, table: str, format: Optional[str], location: Optional[str])
  • DeltaTable(path: str)
  • FilesTable(path: str, format: str)
  • IcebergTable(table: str, warehouse_path: Optional[str])
  • SnowflakeTable(database: str, schema: str, table: str)

You must use the appropriate class based on the type of source you are connecting to. Refer to the documentation below for more information.

transforms.tables.BigQuery

Configures an output table in Google BigQuery ↗.

Constructor:

BigQueryTable(project: str, dataset: str, table: str)
Parameter Type Description Optional
project str The Google Cloud project ID where the BigQuery dataset resides. No
dataset str The name of the BigQuery dataset containing the table. No
table str The name of the BigQuery table. No

transforms.tables.DatabricksTable

Configures an output table in Databricks ↗. Note that writing to tables in Databricks requires external access to be set up in Unity Catalog. Review the Databricks section of the virtual tables documentation for more information.

Constructor:

DatabricksTable(
    catalog: str,
    schema: str,
    table: str,
    format: Optional[str],
    location: Optional[str] = None
)
Parameter Type Description Optional
catalog str The Databricks catalog name. No
schema str The schema (database) name within the catalog. No
table str The table name. No
format Optional[str] The file format (delta, iceberg). Defaults to delta. Yes
location Optional[str] The storage location for an external table (for example, abfss://<bucket-path>/<table-directory>). Yes

The following types of tables can be written to in Databricks:

  • External Delta table: The location parameter should specify the directory in cloud storage where the table should be stored. The format parameter defaults to delta so is not strictly required. Refer to the official Databricks documentation ↗ for more information on external tables.
  • Managed Iceberg table: The format parameter should be set to iceberg. The location where the table will be stored is determined by Unity Catalog, so the location parameter is not required. Refer to the official Databricks documentation ↗ for more information on managed Iceberg tables.

transforms.tables.DeltaTable

Configures an output table in Delta Lake ↗.

Constructor:

DeltaTable(path: str)
Parameter Type Description Optional
path str The storage path to the Delta table. No

Can be used with Azure Blob Filesystem, Google Cloud Storage, or Amazon S3 sources.

transforms.tables.FilesTable

Configures an output table stored in Avro, CSV, or Parquet format in a cloud storage location.

Constructor:

FilesTable(path: str, format: str)
Parameter Type Description Optional
path str The path to the folder . No
format str The file format (avro, csv, parquet). No

Can be used with Azure Blob Filesystem, Google Cloud Storage, or Amazon S3 sources.

transforms.tables.IcebergTable

Configures an output table in an Apache Iceberg ↗ catalog.

Constructor:

IcebergTable(table: str, warehouse_path: Optional[str] = None)
Parameter Type Description Optional
table str The full table identifier (for example, db.table or catalog notation). No
warehouse_path Optional[str] The warehouse storage path for the Iceberg table. Yes

Refer to the Iceberg catalogs section of this documentation for more information on supported sources.

transforms.tables.SnowflakeTable

Configures an output table in Snowflake ↗.

Constructor:

SnowflakeTable(database: str, schema: str, table: str)
Parameter Type Description Optional
database str The Snowflake database name. No
schema str The schema name within the database. No
table str The table name. No

中文翻译

API 概述

transforms-tables 库提供了 TableInputTableOutput 参数,用于在 Python 转换(transform)中与虚拟表(virtual table)进行交互。这些参数的行为与 Foundry 数据集(dataset)中使用的 InputOutput 参数类似。但是,写入虚拟表需要额外的配置来指定源(source)以及表在外部系统中的存储位置。虚拟表将在检查(checks)期间创建,这与数据集输出(dataset outputs)类似。创建完成后,可以从 TableOutput 中移除源和表元数据的额外配置,并替换为表 RID,以使代码更加简洁。

from transforms.api import transform
from transforms.tables import TableInput, TableOutput, TableTransformInput, TableTransformOutput, SnowflakeTable


@transform.spark.using(
    source_table=TableInput(alias: str),
    output_table=TableOutput(
        alias: str,  # 虚拟表应注册的 Compass 路径
        source: str,  # 输出表应写入的源 RID
        table: Table,  # 定义输出表在外部系统中存储位置的定位器(locator)
    ),
)
def compute(source_table: TableTransformInput, output_table: TableTransformOutput):
    ...  # 标准转换 API

TableOutput 中的 table: Table 参数定义了输出表应在外部系统中创建的位置。

:::callout{theme="warning" title="创建后无法更改源和位置"} 虚拟表一旦创建,就无法更改其源或位置。修改源或位置将导致检查失败。 :::

可用的 Table 子类如下:

  • BigQueryTable(project: str, dataset: str, table: str)
  • DatabricksTable(catalog: str, schema: str, table: str, format: Optional[str], location: Optional[str])
  • DeltaTable(path: str)
  • FilesTable(path: str, format: str)
  • IcebergTable(table: str, warehouse_path: Optional[str])
  • SnowflakeTable(database: str, schema: str, table: str)

您必须根据所连接的源类型使用相应的类。更多信息请参考以下文档。

transforms.tables.BigQuery

配置 Google BigQuery ↗ 中的输出表。

构造函数:

BigQueryTable(project: str, dataset: str, table: str)
参数 类型 描述 可选
project str BigQuery 数据集所在的 Google Cloud 项目 ID。
dataset str 包含该表的 BigQuery 数据集名称。
table str BigQuery 表名称。

transforms.tables.DatabricksTable

配置 Databricks ↗ 中的输出表。请注意,写入 Databricks 中的表需要在 Unity Catalog 中设置外部访问(external access)。有关更多信息,请参阅虚拟表文档的 Databricks 部分。

构造函数:

DatabricksTable(
    catalog: str,
    schema: str,
    table: str,
    format: Optional[str],
    location: Optional[str] = None
)
参数 类型 描述 可选
catalog str Databricks 目录名称。
schema str 目录中的模式(schema)(数据库)名称。
table str 表名称。
format Optional[str] 文件格式(deltaiceberg)。默认为 delta
location Optional[str] 外部表的存储位置(例如 abfss://<bucket-path>/<table-directory>)。

以下类型的表可以写入 Databricks:

  • 外部 Delta 表(External Delta table): location 参数应指定表在云存储中存储的目录。format 参数默认为 delta,因此并非严格必需。有关外部表的更多信息,请参阅官方 Databricks 文档 ↗
  • 托管 Iceberg 表(Managed Iceberg table): format 参数应设置为 iceberg。表的存储位置由 Unity Catalog 决定,因此不需要 location 参数。有关托管 Iceberg 表的更多信息,请参阅官方 Databricks 文档 ↗

transforms.tables.DeltaTable

配置 Delta Lake ↗ 中的输出表。

构造函数:

DeltaTable(path: str)
参数 类型 描述 可选
path str Delta 表的存储路径。

可与 Azure Blob 文件系统Google Cloud StorageAmazon S3 源一起使用。

transforms.tables.FilesTable

配置存储在云存储位置中、格式为 Avro、CSV 或 Parquet 的输出表。

构造函数:

FilesTable(path: str, format: str)
参数 类型 描述 可选
path str 文件夹路径。
format str 文件格式(avrocsvparquet)。

可与 Azure Blob 文件系统Google Cloud StorageAmazon S3 源一起使用。

transforms.tables.IcebergTable

配置 Apache Iceberg ↗ 目录(catalog)中的输出表。

构造函数:

IcebergTable(table: str, warehouse_path: Optional[str] = None)
参数 类型 描述 可选
table str 完整的表标识符(例如 db.table 或目录表示法)。
warehouse_path Optional[str] Iceberg 表的仓库存储路径。

有关支持的源的更多信息,请参阅本文档的 Iceberg 目录部分。

transforms.tables.SnowflakeTable

配置 Snowflake ↗ 中的输出表。

构造函数:

SnowflakeTable(database: str, schema: str, table: str)
参数 类型 描述 可选
database str Snowflake 数据库名称。
schema str 数据库中的模式(schema)名称。
table str 表名称。