Google Cloud Storage¶

Connect Foundry to Google Cloud Storage to sync files between Foundry datasets and storage buckets.

Supported capabilities¶

Capability	Status
Exploration	🟢 Generally available
Bulk import	🟢 Generally available
Incremental	🟢 Generally available
Virtual tables	🟢 Generally available
Export tasks	🟡 Sunset
File exports	🟢 Generally available

Data model¶

The connector can transfer files of any type into Foundry datasets. File formats are preserved, and no schemas are applied during or after the transfer. Apply any necessary schema to the output dataset, or write a downstream transformation to access the data.

Performance and limitations¶

There is no limit to the size of transferable files. However, network issues can result in failures of large-scale transfers. In particular, direct cloud syncs that take more than two days to run will be interrupted. To avoid network issues, we recommend using smaller file sizes and limiting the number of files that are ingested in every execution of the sync. Syncs can be scheduled to run frequently.

Setup¶

Open the Data Connection application and select + New Source in the upper right corner of the screen.
Select Google Cloud Storage from the available connector types.
Follow the additional configuration prompts to continue the set up of your connector using the information in the sections below.

Learn more about setting up a connector in Foundry.

:::callout{theme="warning"} You must have a Google Cloud IAM service account ↗ to proceed with Google Cloud Storage authentication and set up. :::

Authentication¶

The following roles are required on the bucket being accessed:

Storage Object Viewer: Read data.
Storage Object Creator: Export data to Google Cloud Storage.
Storage Object Admin: Required for deleting files from Google Cloud Storage after importing them into Foundry, and also for exports with incremental datasets that use UPDATE transactions and overwrite files.

Learn more about required roles in the Google Cloud documentation on access control ↗.

Choose from one of the available authentication methods:

GCP instance: Refer to the Google Cloud documentation ↗ for information on how to set up instance-based authentication.
Note that GCP instance authentication only works for connectors operating through agents that run on appropriately configured instances in GCP.
JSON credentials: Refer to the Google Cloud documentation ↗ for information on how to create and download a JSON service account key file.
PKCS8 auth: Requires entering specific credential information from the JSON service account key file. Refer to the Google Cloud documentation ↗ for information on creating the key file.
Workload Identity Federation (OIDC): Follow the displayed source system configuration instructions to set up OIDC. Refer to the Google Cloud Documentation ↗ for details on Workload Identity Federation and our documentation for details on how OIDC works with Foundry.

Networking¶

The Google Cloud Storage connector requires network access to the following domains on port 443:

storage.googleapis.com
oauth2.googleapis.com (only required when using JSON credentials or PKCS8 auth)
sts.googleapis.com (only required when using Workload Identity Federation)
iamcredentials.googleapis.com (only required when using Workload Identity Federation)

Configuration options¶

The following configuration options are available for the Google Cloud Storage connector:

Option	Required?	Description
`Project Id`	Yes	The ID of the Project containing the Cloud Storage bucket.
`Bucket name`	Yes	The name of the bucket to read/write data to and from.
`Credentials settings`	Yes	Configure using the Authentication guidance shown above.
`Proxy settings`	No	Enable to use a proxy while connecting to Google Cloud Storage.

Sync data from Google Cloud Storage¶

The Google Cloud Storage connector uses the file-based sync interface. See documentation on configuring file-based syncs.

Virtual tables¶

This section provides additional details around using virtual tables from Google Cloud Storage source. This section is not applicable when syncing to Foundry datasets.

The table below highlights the virtual table capabilities that are supported for Google Cloud Storage.

Capability	Status
Bulk registration	🔴 Not available
Automatic registration	🔴 Not available
Table inputs	🟢 Generally available: Avro ↗, Delta ↗, Iceberg ↗, Parquet ↗ in Code Repositories, Pipeline Builder
Table outputs	🟢 Generally available: Avro ↗, Delta ↗, Iceberg ↗, Parquet ↗ in Code Repositories, Pipeline Builder
Incremental pipelines	🟢 Generally available for Delta tables: `APPEND` only (details) 🟢 Generally available for Iceberg tables: `APPEND` only (details) 🔴 Not available for Parquet tables
Compute pushdown	🔴 Not available

Consult the virtual tables documentation for details on the supported Foundry workflows where tables stored in Google Cloud Storage can be used as inputs or outputs.

Source configuration requirements¶

When using virtual tables, remember the following source configuration requirements:

You must use a Foundry worker source. Virtual tables do not support use of agent worker connections.
Ensure that bi-directional connectivity and allowlisting is established as described in the Networking section of this documentation.
If using virtual tables in Code Repositories, refer to the virtual tables documentation for details of additional source configuration required.
When setting up the source credentials, you must use one of JSON credentials, PKCS8 auth or Workload Identity Federation (OIDC). Other credential options are not supported when using virtual tables.

Delta¶

To enable incremental support for pipelines backed by virtual tables, ensure that Change Data Feed ↗ is enabled on the source Delta table. The current and added read modes in Python Transforms are supported. The _change_type, _commit_version and _commit_timestamp columns will be made available in Python Transforms.

Iceberg¶

An Iceberg catalog is required to load virtual tables backed by an Apache Iceberg table. To learn more about Iceberg catalogs, see the Apache Iceberg documentation ↗. All Iceberg tables registered on a source must use the same Iceberg catalog.

Tables will be created using Iceberg metadata files in GCS. A warehousePath indicating the location of these metadata files must be provided when registering a table.

Incremental support relies on Iceberg Incremental Reads ↗ and is currently append-only. The current and added read modes in Python Transforms are supported.

Parquet¶

Virtual tables using Parquet rely on schema inference. At most 100 files will be used to determine the schema.

Export data to Google Cloud Storage¶

The connector can copy files from a Foundry dataset to any location on the Google Cloud Storage bucket.

To begin exporting data, you must configure an export task. Navigate to the Project folder that contains the Google Cloud Storage connector to which you want to export. Right select on the connector name, then select Create Data Connection Task.

In the left panel of the Data Connection view:

Verify the Source name matches the connector you want to use.
Add an Input named inputDataset. The input dataset is the Foundry dataset being exported.
Add an Output named outputDataset. The output dataset is used to run, schedule, and monitor the task.
Finally, add a YAML block in the text field to define the task configuration.

:::callout{theme="neutral"} The labels for the connector and input dataset that appear in the left side panel do not reflect the names defined in the YAML. :::

Use the following options when creating the export task YAML:

Option	Required?	Description
`directoryPath`	Yes	The directory in Cloud Storage where files will be written.
`excludePaths`	No	A list of regular expressions; files with names matching these expressions will not be exported.
`uploadConfirmation`	No	When the value is `exportedFiles`, the output dataset will contain a list of files that were exported.
`retriesPerFile`	No	If experiencing network failures, increase this number to allow the export job to retry uploads to Cloud Storage before failing the entire job.
`createTransactionFolders`	No	When enabled, data will be written to a subfolder within the specified `directoryPath`. Every subfolder is based on the time the transaction was committed in Foundry and has a unique name for every exported transaction.
`threads`	No	Set the number of threads used to upload files in parallel. Increase the number to use more resources. Ensure that exports running on agents have enough resources on the agent to handle increased parallelization.
`incrementalType`	No	For datasets that are built incrementally, set to `incremental` to only export transactions that occurred since the previous export.

Example task configuration:

type: export-google-cloud-storage
directoryPath: directory/to/export/to
excludePaths:
  -  ^_.*
  - ^spark/_.*
uploadConfirmation: exportedFiles
incrementalType: incremental
retriesPerFile: 0
createTransactionFolders: true
threads: 0

After you configure the export task, select Save in the upper right corner.

中文翻译¶

Google Cloud Storage¶

将 Foundry 连接到 Google Cloud Storage，即可在 Foundry 数据集(Foundry datasets)与存储桶(storage buckets)之间同步文件。

支持的功能¶

功能	状态
探索(Exploration)	🟢 正式发布(Generally available)
批量导入(Bulk import)	🟢 正式发布(Generally available)
增量同步(Incremental)	🟢 正式发布(Generally available)
虚拟表(Virtual tables)	🟢 正式发布(Generally available)
导出任务(Export tasks)	🟡 即将停用(Sunset)
文件导出(File exports)	🟢 正式发布(Generally available)

数据模型¶

该连接器(connector)可将任意类型的文件传输到 Foundry 数据集(Foundry datasets)中。文件格式保持不变，传输过程中及传输后均不应用任何模式(schema)。请对输出数据集(output dataset)应用所需的模式，或编写下游转换(downstream transformation)来访问数据。

性能与限制¶

可传输的文件大小没有上限。但网络问题可能导致大规模传输失败。特别是，运行时间超过两天的直接云同步(direct cloud sync)将会中断。为避免网络问题，建议使用较小的文件，并限制每次同步执行时摄取的文件数量。同步可以按计划频繁运行。

设置¶

打开 Data Connection 应用，点击屏幕右上角的 + New Source。
从可用的连接器类型中选择 Google Cloud Storage。
按照后续配置提示，使用以下各节中的信息完成连接器设置。

详细了解如何在 Foundry 中设置连接器。

:::callout{theme="warning"} 您必须拥有一个 Google Cloud IAM 服务账号(Google Cloud IAM service account) ↗ 才能进行 Google Cloud Storage 身份验证和设置。 :::

身份验证¶

被访问的存储桶(bucket)需要具备以下角色：

Storage Object Viewer：读取数据。
Storage Object Creator：将数据导出到 Google Cloud Storage。
Storage Object Admin：将文件从 Google Cloud Storage 导入 Foundry 后删除文件，以及在使用 UPDATE 事务(transactions)和覆盖文件的增量数据集(incremental datasets)进行导出时需要此角色。

更多关于所需角色的信息，请参阅 Google Cloud 访问控制文档 ↗。

从以下可用的身份验证方法中选择一种：

GCP 实例(GCP instance)： 请参阅 Google Cloud 文档 ↗ 了解如何设置基于实例的身份验证。
请注意，GCP 实例身份验证仅适用于通过运行在 GCP 中适当配置的实例上的代理(agents)进行操作的连接器。
JSON 凭据(JSON credentials)： 请参阅 Google Cloud 文档 ↗ 了解如何创建和下载 JSON 服务账号密钥文件。
PKCS8 身份验证(PKCS8 auth)： 需要输入 JSON 服务账号密钥文件中的特定凭据信息。请参阅 Google Cloud 文档 ↗ 了解如何创建密钥文件。
工作负载身份联合(Workload Identity Federation, OIDC)： 按照显示的源系统配置说明设置 OIDC。有关工作负载身份联合的详细信息，请参阅 Google Cloud 文档 ↗；有关 OIDC 如何与 Foundry 配合使用的详细信息，请参阅我们的文档。

网络¶

Google Cloud Storage 连接器需要通过端口 443 访问以下域名：

storage.googleapis.com
oauth2.googleapis.com（仅在使用 JSON 凭据或 PKCS8 身份验证时需要）
sts.googleapis.com（仅在使用工作负载身份联合时需要）
iamcredentials.googleapis.com（仅在使用工作负载身份联合时需要）

配置选项¶

Google Cloud Storage 连接器提供以下配置选项：

选项	是否必需	描述
`Project Id`	是	包含 Cloud Storage 存储桶的项目 ID。
`Bucket name`	是	要读取/写入数据的存储桶名称。
`Credentials settings`	是	按照上方身份验证指南进行配置。
`Proxy settings`	否	启用后可在连接 Google Cloud Storage 时使用代理。

从 Google Cloud Storage 同步数据¶

Google Cloud Storage 连接器使用基于文件的同步接口(file-based sync interface)。请参阅配置基于文件的同步文档。

虚拟表(Virtual tables)¶

本节提供有关使用来自 Google Cloud Storage 源的虚拟表(Virtual tables)的更多详细信息。本节不适用于同步到 Foundry 数据集的情况。

下表列出了 Google Cloud Storage 支持的虚拟表功能。

功能	状态
批量注册(Bulk registration)	🔴 不可用
自动注册(Automatic registration)	🔴 不可用
表输入(Table inputs)	🟢 正式发布(Generally available)：Avro ↗、Delta ↗、Iceberg ↗、Parquet ↗ 在代码仓库(Code Repositories)、Pipeline Builder 中
表输出(Table outputs)	🟢 正式发布(Generally available)：Avro ↗、Delta ↗、Iceberg ↗、Parquet ↗ 在代码仓库(Code Repositories)、Pipeline Builder 中
增量管道(Incremental pipelines)	🟢 正式发布(Generally available)：Delta 表仅支持 `APPEND`（详情） 🟢 正式发布(Generally available)：Iceberg 表仅支持 `APPEND`（详情） 🔴 不可用：Parquet 表
计算下推(Compute pushdown)	🔴 不可用

请查阅虚拟表文档，了解支持的 Foundry 工作流详情，其中存储在 Google Cloud Storage 中的表可用作输入或输出。

源配置要求¶

使用虚拟表(Virtual tables)时，请牢记以下源配置要求：

必须使用 Foundry 工作节点(Foundry worker) 源。虚拟表不支持使用代理工作节点(agent worker) 连接。
确保按照本文档的网络部分所述建立双向连接并列入白名单。
如果在代码仓库(Code Repositories)中使用虚拟表，请参阅虚拟表文档了解所需的额外源配置详情。
设置源凭据时，必须使用 JSON credentials、PKCS8 auth 或 Workload Identity Federation (OIDC) 之一。使用虚拟表时不支持其他凭据选项。

Delta¶

要为基于虚拟表的管道启用增量支持，请确保在源 Delta 表上启用变更数据馈送(Change Data Feed) ↗。支持 Python 转换(Python Transforms) 中的 current 和 added 读取模式。_change_type、_commit_version 和 _commit_timestamp 列将在 Python 转换中可用。

Iceberg¶

加载基于 Apache Iceberg 表的虚拟表需要 Iceberg 目录(catalog)。要了解有关 Iceberg 目录的更多信息，请参阅 Apache Iceberg 文档 ↗。源上注册的所有 Iceberg 表必须使用相同的 Iceberg 目录。

表将使用 GCS 中的 Iceberg 元数据文件创建。注册表时必须提供指示这些元数据文件位置的 warehousePath。

增量支持依赖于 Iceberg 增量读取(Incremental Reads) ↗，目前仅支持追加操作。支持 Python 转换(Python Transforms) 中的 current 和 added 读取模式。

Parquet¶

使用 Parquet 的虚拟表依赖于模式推断(schema inference)。最多将使用 100 个文件来确定模式。

将数据导出到 Google Cloud Storage¶

该连接器可以将文件从 Foundry 数据集复制到 Google Cloud Storage 存储桶上的任意位置。

要开始导出数据，必须配置一个导出任务(export task)。导航到包含要导出的 Google Cloud Storage 连接器的项目文件夹。右键单击连接器名称，然后选择 Create Data Connection Task。

在 Data Connection 视图的左侧面板中：

确认 Source 名称与要使用的连接器匹配。
添加一个名为 inputDataset 的输入(Input)。输入数据集(input dataset) 是要导出的 Foundry 数据集。
添加一个名为 outputDataset 的输出(Output)。输出数据集(output dataset) 用于运行、调度和监控任务。
最后，在文本字段中添加一个 YAML 块来定义任务配置。

:::callout{theme="neutral"} 左侧面板中显示的连接器和输入数据集的标签不反映 YAML 中定义的名称。 :::

创建导出任务 YAML 时使用以下选项：

选项	是否必需	描述
`directoryPath`	是	Cloud Storage 中文件将被写入的目录。
`excludePaths`	否	正则表达式列表；名称匹配这些表达式的文件将不会被导出。
`uploadConfirmation`	否	当值为 `exportedFiles` 时，输出数据集将包含已导出文件的列表。
`retriesPerFile`	否	如果遇到网络故障，增加此数字可允许导出作业在失败前重试上传到 Cloud Storage。
`createTransactionFolders`	否	启用后，数据将写入指定 `directoryPath` 内的子文件夹。每个子文件夹基于事务在 Foundry 中提交的时间，并且每个导出的事务都有唯一的名称。
`threads`	否	设置用于并行上传文件的线程数。增加此数字可使用更多资源。确保在代理上运行的导出作业有足够的资源来处理更高的并行度。
`incrementalType`	否	对于增量构建的数据集，设置为 `incremental` 以仅导出自上次导出以来发生的事务。

示例任务配置：

type: export-google-cloud-storage
directoryPath: directory/to/export/to
excludePaths:
  -  ^_.*
  - ^spark/_.*
uploadConfirmation: exportedFiles
incrementalType: incremental
retriesPerFile: 0
createTransactionFolders: true
threads: 0

配置完导出任务后，点击右上角的 Save。