Snowflake¶
Connect Foundry to Snowflake to read and sync data between Snowflake and Foundry.
Supported capabilities¶
| Capability | Status |
|---|---|
| Exploration | 🟢 Generally available |
| Bulk import | 🟢 Generally available |
| Incremental | 🟢 Generally available |
| Virtual tables | 🟢 Generally available |
| Compute pushdown | 🟢 Generally available |
| Export tasks | 🟡 Sunset |
| Table Exports | 🟢 Generally available |
Setup¶
- Open the Data Connection application and select + New Source in the upper right corner of the screen.
- Select Snowflake from the available connector types.
- Follow the additional configuration prompts to continue the setup of your connector using the information in the sections below.
Learn more about setting up a connector in Foundry.
Connection details¶
:::callout{theme="warning"}
Snowflake accounts with underscores (_) must replace underscores with dashes (-). For example, my_account_prod needs to become my-account-prod. Failure to do so will cause networking issues.
:::
| Option | Required? | Description |
|---|---|---|
Account identifier |
Yes | This is the identifier that precedes ".snowflakecomputing.com". See Snowflake's official documentation ↗ for more details. |
Roles |
No | This is the default role to be used by the connection in case the credentials provided have access to multiple roles. |
Database |
Yes | Specify a default database to use once connected. |
Schema |
No | Option to specify a default schema to use once connected. If not specified, all schemas will be available that are in-scope of the credentials. |
Warehouse |
No* | The virtual warehouse to use once connected. In the case of registered virtual tables, this will be used for any source-side compute. |
Credentials |
Yes | Refer to the authentication section below for more details. |
Network Connectivity |
Yes | Refer to the networking section below for more details. |
* Warehouse details are optional for syncing Foundry datasets, but required for registering virtual tables.
Authentication¶
You can authenticate with Snowflake in the following ways:
| Method | Description | Documentation |
|---|---|---|
| Username and password [Legacy] | Authenticate with a user account using a username and password. Basic authentication is legacy and not recommended in production. | Working with passwords ↗ |
| Key-pair authentication | Provide a username and private key. Note that only unencrypted private keys are supported. Foundry will encrypt and store the private key securely. | Key-pair authentication and key-pair rotation ↗ |
| External OAuth (OIDC) [Recommended] | Authenticate as a user using workload identity federation. Workload identity federation allows workloads running in Foundry to access Snowflake without the need for Snowflake secrets. Follow the displayed source system configuration instructions to set up external OAuth. | Workload identity federation ↗ Refer to our OIDC documentation for an overview of how OpenID Connect (OIDC) is supported in Foundry. |
| Programmatic access token | Authenticate as a user using a programmatic access token (PAT). | Programmatic access tokens ↗ |
For all authentication options, ensure that the provided user and role has usage privileges on the target database(s) and schema(s), as well as select privileges on the target table(s).
When registering virtual tables, the user and their role should also have usage privileges on the warehouse.
:::callout{theme="warning"} Snowflake is rolling out changes to require multi-factor authentication (MFA) for human users that use passwords, and to disallow passwords for all service users. As such, Username and password will no longer be a suitable authentication mechanism. Refer to the official Snowflake documentation ↗ for additional information and guidance on migrating. :::
Networking¶
For connections running on a Foundry worker, the appropriate egress policies must be added when setting up the source in the Data Connection application.
To identify the hostnames and port numbers of your Snowflake account to be allowlisted, you can run the following command in your Snowflake console. Ensure that at least the entries for SNOWFLAKE_DEPLOYMENT and STAGE are added as egress policies in Foundry.
SELECT t.VALUE:type::VARCHAR as type,
t.VALUE:host::VARCHAR as host,
t.VALUE:port as port
FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t;
See Snowflake's official documentation ↗ for additional information on identifying hostnames and port numbers to allowlist.
:::callout{theme="neutral"} Connections from Foundry to Snowflake normally come from the default public gateway IPs for your environment. However, traffic within the same cloud provider (for example, AWS-AWS or Azure-Azure) may use different routing, and require establishing a connection via PrivateLink. See below for the additional setup required per cloud provider, or contact your Palantir representative for additional guidance. :::
Snowflake instance hosted on S3¶
If your Snowflake instance is configured to route internal S3 stage traffic through a VPCE ↗, the Snowflake JDBC driver must be manually configured to not use the custom VPCE domain. Otherwise, the driver will be routed to the custom VPCE domain (which is inaccessible from Foundry's VPC) and will fail connections to URLs with the format of <bucketname>.bucket.vpce-<vpceid>.s3.<region>.vpce.amazonaws.com.
You can manually configure this by adding a JDBC connection property in the Connection details of your instance, with a key of S3_STAGE_VPCE_DNS_NAME and an empty value field (the equivalent of setting it to null).
The S3 stage traffic will then be routed through the AWS S3 Gateway Endpoint (<bucketname>.bucket.s3.<region>.vpce.amazonaws.com) which maintains private connectivity so traffic will not be routed through the public internet.
Review our PrivateLink egress documentation ↗ for more information.
:::callout{theme="neutral"} For egress policies that depend on an S3 bucket in the same region as your Foundry instance, ensure you have completed the additional configuration steps detailed in our Amazon S3 bucket policy documentation for the affected bucket(s). :::
Snowflake instance hosted on Azure¶
The Snowflake JDBC driver used for the Foundry Snowflake connector may attempt to connect directly to an underlying “internal stage” storage bucket when fetching data. For Snowflake hosted on Azure, because Azure-hosted Foundry enrollments route traffic over Azure service endpoints, network connectivity from Foundry to the underlying stage buckets must be explicitly allow-listed by following the instructions below.
Gather the required information about your Snowflake warehouse¶
You will need the following information about your Azure-hosted Snowflake warehouse to establish network connectivity to Foundry:
- Full list of system allowlist domains.
- Your Azure storage account identifier, obtained from Snowflake.
System allowlist domains¶
Use the SYSTEM$ALLOWLIST command to get the full list of domains that may be required to successfully connect.
- Note: This is the same command than used above to define egress policies, and is explained in the network panel callout in Data Connection.
- This list will include the domain of an Azure storage bucket used as the stage for your Snowflake warehouse.
Azure storage account identifier¶
For the Azure storage bucket returned from the SYSTEM$ALLOWLIST command, you will also need to retrieve the storage account identifier.
- If you are using Snowflake Standard Edition or Enterprise Edition, you will need to file a ticket with Snowflake support to request the storage account identifier.
- If you are using Snowflake Business Critical Edition ↗, you can retrieve the storage account identifier with the following steps:
- Set the
ENABLE_INTERNAL_STAGES_PRIVATELINK↗ parameter toTRUEfor the account. - Then, call the
SYSTEM$GET_PRIVATELINK_CONFIG()↗ function, which returns a field calledprivatelink-internal-stagecontaining the Azure storage account resource identifier.- Note that even if you are not connecting over a PrivateLink, you still need to retrieve and provide the storage account resource identifier.
A full Azure Storage account resource identifier will be in the following format:
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Storage/storageAccounts/{storageAccountName}
More information on how to find an Azure Storage account resource ID directly in the Azure console can be found in the Azure documentation ↗. Restricting cross-account network traffic using VNET rules and the storage account identifier is in line with Microsoft’s published best practices ↗, and should be used for all connections to Azure-hosted Snowflake warehouses from within Azure compute instances.
Allow outbound traffic from Foundry to the Azure storage account associated with your Snowflake warehouse¶
Now that you have gathered the required information about your Snowflake warehouse, you can create the required policies needed to enable Foundry access to your Snowflake data.
- Create a standard egress policy for the Azure storage internal stage, and attach it to your Snowflake source.
-
Note that you should add policies for everything returned from the
SYSTEM$ALLOWLISTcommand, and not just the storage bucket domain. -
Create an Azure storage policy, pasting in the storage account resource identifier.
Navigate back to your Snowflake source in Data Connection and confirm you can explore the source and run syncs.
Iceberg tables (virtual tables only)¶
The Virtual tables section of this documentation provides details on integrating Iceberg tables registered in Snowflake Horizon Catalog. This functionality requires network connectivity to the external volume where the Iceberg table is stored. You must create network egress policies for each external volume, as Foundry reads and writes to the storage location directly rather than querying the table using a Snowflake compute warehouse.
Refer to the official Snowflake documentation ↗ for more information on external volumes and how to determine table storage locations.
:::callout{theme="neutral"}
Configuring egress policies for an external volume enables network traffic to egress from Foundry to that storage location. Network controls may vary between cloud providers, so you should ensure that any network controls on the storage location permit network traffic from Foundry. Learn more about identifying the IP addresses where Foundry traffic originates..
Additionally, refer to the other sections of this Networking documentation to ensure you correctly set up connections to external volumes hosted in the same cloud region as your Foundry instance.
:::
Virtual tables¶
This section provides additional details around using virtual tables with a Snowflake source. This section is not applicable when syncing to Foundry datasets.
:::callout{theme="neutral"}
The Snowflake connector now offers enhanced functionality when using virtual tables to access Iceberg tables registered in Horizon Catalog. Foundry uses the Iceberg REST APIs exposed in Horizon Catalog to access tables as well as read and write data in the underlying storage locations, which are configured as external volumes in Snowflake. Additionally, Foundry uses Horizon Catalog credential vending to ensure secure access to cloud object storage. This can also improve the performance of reads and writes against these tables.
The connector exposes Iceberg functionality automatically if you:
- Configure network egress policies that allow connectivity from Foundry to the external volume that stores the table.
- Configure credentials on the source that have permission to obtain vended credentials from Horizon Catalog.
Foundry uses Iceberg clients to establish connections to read or write tables to the storage location directly without using Snowflake compute. However, Foundry still uses the warehouse configured on the source for certain metadata queries, such as determining the type of table being accessed.
Refer to the official Snowflake documentation ↗ for more information on querying Iceberg tables with an external engine through Snowflake Horizon Catalog. Writes to Horizon Catalog are currently a private preview Snowflake feature. Refer to the Networking section of this documentation for details on enabling network access to external volumes.
Foundry treats Iceberg tables like regular Snowflake tables if any of the above requirements are not met. Connections to Snowflake are made using the same mechanism as for other Snowflake data (such as tables, views, or materialized views) and rely on a Snowflake compute warehouse to read and write from the table. :::
The table below highlights the virtual table capabilities that are supported for Snowflake.
| Capability | Status |
|---|---|
| Bulk registration | 🟢 Generally available |
| Automatic registration | 🟢 Generally available |
| Table inputs | 🟢 Generally available: tables, views, materialized views in Code Repositories, Pipeline Builder |
| Table outputs | 🟢 Generally available: tables in Code Repositories, Pipeline Builder |
| Incremental pipelines | 🟢 Generally available: APPEND only, tables only [1] |
| Compute pushdown | 🟢 Generally available: Python transforms, Pipeline Builder |
Consult the virtual tables documentation for details on the supported Foundry workflows where Snowflake tables can be used as inputs or outputs.
[1] Incremental pipelines are supported for tables only. Views are not supported for incremental pipelines because they are not versioned. To enable incremental support for pipelines backed by Snowflake virtual tables, ensure that Change Tracking ↗ and Time Travel ↗ are enabled for the appropriate retention period. This functionality relies on CHANGES ↗ The current and added read modes in Python Transforms are supported. These will expose the relevant rows of the change feed based on the METADATA$ACTION column. The METADATA$ACTION, METADATA$ISUPDATE, METADATA$ROW_ID columns will be made available in Python Transforms.
Privileges on source credentials¶
For full feature support, you should provide the following privileges to the credentials configured for the source connection. You should apply these on either the database, schema, or table depending on the desired inheritance model.
| Category | Privilege | Notes |
|---|---|---|
| Prerequisite | USAGE |
Must be granted on the Snowflake databases and schemas that will be used in Foundry. |
| Read | SELECT |
Required to read Snowflake tables when using syncs or virtual table inputs. |
| Edit | DELETE, INSERT, TRUNCATE, UPDATE |
Required to modify Snowflake tables when using virtual table outputs. |
| Create | CREATE SCHEMA, CREATE TABLE |
Required to create Snowflake tables when using virtual table outputs. |
When using Iceberg tables, USAGE privilege is required on the external volume where the table is stored.
Additionally, the credentials provided must have usage privileges on the warehouse provided in the source configuration.
Refer to the official Snowflake documentation ↗ for more information on access control privileges in Snowflake.
Source configuration requirements¶
When using virtual tables, remember the following source configuration requirements:
- You must use a Foundry worker source. Virtual tables do not support use of agent worker connections.
- Ensure that bi-directional connectivity and allowlisting is established as described in the Networking section of this documentation.
- If using virtual tables in Code Repositories, refer to the Virtual Tables documentation for details of additional source configuration required.
- You must specify a warehouse in the connection details.
- The credentials provided must have usage privileges on the warehouse.
See the Connection Details section above for more details.
Compute pushdown¶
Foundry offers the ability to push down compute to Snowflake when using virtual tables. Virtual table inputs leverage the Snowflake Spark connector ↗ which has built-in support for predicate pushdown.
When using Snowflake virtual tables registered to the same source as inputs and outputs to a pipeline, it is possible to fully federate compute to Snowflake. To push down compute to Snowflake, review the Python documentation for details. To push down compute to Snowflake in Pipeline Builder, review the External pipelines documentation.
Use Snowflake sources in code¶
You can use pro-code alternatives to connect to Snowflake sources for more complex scenarios.
The examples below demonstrate how to connect to a Snowflake source using the Snowflake Connector for Python ↗ (snowflake-connector-python) in an external transform. Authentication is handled via OIDC, which provides short-lived OAuth tokens without the need for static Snowflake credentials.
Read from Snowflake with an external transform¶
This example reads data from a Snowflake table using OIDC-based OAuth credentials.
import polars as pl
from transforms.api import transform, Output, LightweightOutput
from transforms.external.systems import ResolvedSource, external_systems, Source
import snowflake.connector
@external_systems(
snowflake_source=Source("<source_rid>")
)
@transform.using(
output=Output("<output_dataset_rid>")
)
def read_from_snowflake(output: LightweightOutput, snowflake_source: ResolvedSource):
conn = snowflake.connector.connect(
authenticator="oauth",
token=snowflake_source.get_session_credentials().get().access_token,
account="<account_identifier>",
warehouse="<warehouse_name>",
database="<database_name>",
)
cursor = conn.cursor()
try:
cursor.execute("SELECT * FROM my_table LIMIT 100")
columns = [col[0] for col in cursor.description]
rows = cursor.fetchall()
finally:
cursor.close()
conn.close()
df = pl.DataFrame(rows, schema=columns)
output.write_table(df)
Write to Snowflake with an external transform¶
This example exports data from a Foundry dataset to a Snowflake table.
import polars as pl
from transforms.api import transform, Input, Output, LightweightOutput
from transforms.external.systems import ResolvedSource, external_systems, Source
import snowflake.connector
@external_systems(
snowflake_source=Source("<source_rid>")
)
@transform.using(
output=Output("<output_dataset_rid>"),
source_data=Input("<input_dataset_rid>"),
)
def write_to_snowflake(output: LightweightOutput, snowflake_source: ResolvedSource, source_data):
conn = snowflake.connector.connect(
authenticator="oauth",
token=snowflake_source.get_session_credentials().get().access_token,
account="<account_identifier>",
warehouse="<warehouse_name>",
database="<database_name>",
)
cursor = conn.cursor()
try:
df = source_data.dataframe()
for row in df.iter_rows(named=True):
cursor.execute(
"INSERT INTO my_table (col1, col2) VALUES (%s, %s)",
(row["col1"], row["col2"]),
)
finally:
cursor.close()
conn.close()
output.write_table(df)
:::callout{theme="neutral"} For more details on using session credentials with OIDC-enabled sources, review the Sources in Python documentation. :::
Data model¶
Note that columns of type array ↗, object ↗, and variant ↗ will be parsed by Foundry as type string. This is due to the source's variable typing.
For example, the Snowflake array [ 1, 2, 3 ] would be interpreted by Foundry as the string "[1,2,3]".
See Snowflake's official documentation ↗for more details.
中文翻译¶
Snowflake¶
将 Foundry 连接到 Snowflake,以在 Snowflake 和 Foundry 之间读取和同步数据。
支持的功能¶
| 功能 | 状态 |
|---|---|
| 探索(Exploration) | 🟢 正式可用 |
| 批量导入(Bulk import) | 🟢 正式可用 |
| 增量同步(Incremental) | 🟢 正式可用 |
| 虚拟表(Virtual tables) | 🟢 正式可用 |
| 计算下推(Compute pushdown) | 🟢 正式可用 |
| 导出任务(Export tasks) | 🟡 即将停用 |
| 表导出(Table Exports) | 🟢 正式可用 |
设置¶
- 打开 数据连接(Data Connection) 应用,在屏幕右上角选择 + 新建源(+ New Source)。
- 从可用的连接器类型中选择 Snowflake。
- 按照后续配置提示,使用以下各节中的信息继续设置连接器。
了解更多关于在 Foundry 中设置连接器的信息。
连接详情¶
:::callout{theme="warning"}
包含下划线(_)的 Snowflake 账户必须将下划线替换为短横线(-)。例如,my_account_prod 需要变为 my-account-prod。否则将导致网络问题。
:::
| 选项 | 是否必填 | 描述 |
|---|---|---|
账户标识符(Account identifier) |
是 | 这是 ".snowflakecomputing.com" 之前的标识符。更多详情请参见 Snowflake 的官方文档 ↗。 |
角色(Roles) |
否 | 当提供的凭据可访问多个角色时,连接使用的默认角色。 |
数据库(Database) |
是 | 指定连接后使用的默认数据库。 |
模式(Schema) |
否 | 可选,指定连接后使用的默认模式。如果未指定,则凭据范围内的所有模式均可用。 |
仓库(Warehouse) |
否* | 连接后使用的虚拟仓库。对于已注册的虚拟表(Virtual tables),此仓库将用于源端计算。 |
凭据(Credentials) |
是 | 更多详情请参见下面的身份验证(Authentication)部分。 |
网络连接(Network Connectivity) |
是 | 更多详情请参见下面的网络(Networking)部分。 |
* 同步 Foundry 数据集(Foundry datasets)时仓库详情为可选,但注册虚拟表(Virtual tables)时为必填。
身份验证¶
您可以通过以下方式对 Snowflake 进行身份验证:
| 方法 | 描述 | 文档 |
|---|---|---|
| 用户名和密码 [旧版] | 使用用户名和密码通过用户账户进行身份验证。基本身份验证为旧版,不建议在生产环境中使用。 | 使用密码 ↗ |
| 密钥对身份验证(Key-pair authentication) | 提供用户名和私钥。请注意,仅支持未加密的私钥。Foundry 将安全地加密并存储私钥。 | 密钥对身份验证和密钥对轮换 ↗ |
| 外部 OAuth (OIDC) [推荐] | 使用工作负载身份联合(workload identity federation)以用户身份进行身份验证。工作负载身份联合允许在 Foundry 中运行的工作负载访问 Snowflake,而无需使用 Snowflake 密钥。按照显示的源系统配置说明设置外部 OAuth。 | 工作负载身份联合 ↗ 请参阅我们的 OIDC 文档,了解 Foundry 如何支持 OpenID Connect (OIDC) 的概述。 |
| 编程访问令牌(Programmatic access token) | 使用编程访问令牌 (PAT) 以用户身份进行身份验证。 | 编程访问令牌 ↗ |
对于所有身份验证选项,请确保提供的用户和角色对目标数据库和模式具有使用权限,并对目标表具有选择权限。
注册虚拟表(Virtual tables)时,用户及其角色还应对仓库具有使用权限。
:::callout{theme="warning"} Snowflake 正在逐步推行更改,要求使用密码的人类用户进行多因素身份验证 (MFA),并禁止所有服务用户使用密码。因此,用户名和密码将不再适合作为身份验证机制。请参阅 Snowflake 官方文档 ↗ 获取更多信息和迁移指导。 :::
网络¶
对于在 Foundry 工作节点上运行的连接,在数据连接应用中设置源时,必须添加相应的出站策略(Egress policies)。
要识别需要加入白名单的 Snowflake 账户主机名和端口号,您可以在 Snowflake 控制台中运行以下命令。确保至少将 SNOWFLAKE_DEPLOYMENT 和 STAGE 的条目作为出站策略添加到 Foundry 中。
SELECT t.VALUE:type::VARCHAR as type,
t.VALUE:host::VARCHAR as host,
t.VALUE:port as port
FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t;
有关识别需要加入白名单的主机名和端口号的更多信息,请参见 Snowflake 的官方文档 ↗。
:::callout{theme="neutral"} 从 Foundry 到 Snowflake 的连接通常来自您环境的默认公共网关 IP。但是,同一云提供商内部的流量(例如 AWS-AWS 或 Azure-Azure)可能使用不同的路由,并且需要通过 PrivateLink 建立连接。请参见下方每个云提供商所需的额外设置,或联系您的 Palantir 代表获取更多指导。 :::
托管在 S3 上的 Snowflake 实例¶
如果您的 Snowflake 实例配置为通过 VPCE ↗ 路由内部 S3 阶段(stage)流量,则必须手动配置 Snowflake JDBC 驱动程序不使用自定义 VPCE 域名。否则,驱动程序将被路由到自定义 VPCE 域名(从 Foundry 的 VPC 无法访问),并且将无法连接到格式为 <bucketname>.bucket.vpce-<vpceid>.s3.<region>.vpce.amazonaws.com 的 URL。
您可以通过在实例的连接详情(Connection details)中添加一个 JDBC 连接属性来手动配置,键为 S3_STAGE_VPCE_DNS_NAME,值为空(相当于将其设置为 null)。
然后,S3 阶段流量将通过 AWS S3 网关端点 (<bucketname>.bucket.s3.<region>.vpce.amazonaws.com) 路由,该端点保持私有连接,因此流量不会通过公共互联网路由。
请查看我们的 PrivateLink 出站文档 了解更多信息。
:::callout{theme="neutral"} 对于依赖于与 Foundry 实例位于同一区域的 S3 存储桶的出站策略,请确保您已完成Amazon S3 存储桶策略文档中详述的受影响存储桶的额外配置步骤。 :::
托管在 Azure 上的 Snowflake 实例¶
用于 Foundry Snowflake 连接器的 Snowflake JDBC 驱动程序在获取数据时可能会尝试直接连接到底层"内部阶段(internal stage)"存储桶。对于托管在 Azure 上的 Snowflake,由于Azure 托管的 Foundry 注册环境通过 Azure 服务端点路由流量,因此必须按照以下说明明确允许从 Foundry 到底层阶段存储桶的网络连接。
收集有关 Snowflake 仓库的必要信息¶
您需要以下关于 Azure 托管的 Snowflake 仓库的信息,以建立与 Foundry 的网络连接:
- 系统白名单域名(System allowlist domains)的完整列表。
- 从 Snowflake 获取的 Azure 存储账户标识符(Azure storage account identifier)。
系统白名单域名¶
使用 SYSTEM$ALLOWLIST 命令获取成功连接可能需要的所有域名的完整列表。
- 注意:这与上面用于定义出站策略的命令相同,并在数据连接的网络面板提示中进行了说明。
- 此列表将包括用作 Snowflake 仓库阶段的 Azure 存储桶的域名。
Azure 存储账户标识符¶
对于从 SYSTEM$ALLOWLIST 命令返回的 Azure 存储桶,您还需要检索存储账户标识符。
- 如果您使用的是 Snowflake Standard Edition 或 Enterprise Edition,则需要向 Snowflake 支持提交工单以请求存储账户标识符。
- 如果您使用的是 Snowflake Business Critical Edition ↗,可以通过以下步骤检索存储账户标识符:
- 将账户的
ENABLE_INTERNAL_STAGES_PRIVATELINK↗ 参数设置为TRUE。 - 然后,调用
SYSTEM$GET_PRIVATELINK_CONFIG()↗ 函数,该函数返回一个名为privatelink-internal-stage的字段,其中包含 Azure 存储账户资源标识符。- 请注意,即使您不通过 PrivateLink 连接,仍然需要检索并提供存储账户资源标识符。
完整的 Azure 存储账户资源标识符格式如下:
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Storage/storageAccounts/{storageAccountName}
有关如何直接在 Azure 控制台中查找 Azure 存储账户资源 ID 的更多信息,请参见 Azure 文档 ↗。使用 VNET 规则和存储账户标识符限制跨账户网络流量符合微软发布的最佳实践 ↗,并且应适用于从 Azure 计算实例到 Azure 托管的 Snowflake 仓库的所有连接。
允许从 Foundry 到与 Snowflake 仓库关联的 Azure 存储账户的出站流量¶
现在您已收集了有关 Snowflake 仓库的必要信息,可以创建所需的策略以允许 Foundry 访问您的 Snowflake 数据。
- 为 Azure 存储内部阶段创建一个标准出站策略,并将其附加到您的 Snowflake 源。
-
请注意,您应该为
SYSTEM$ALLOWLIST命令返回的所有内容添加策略,而不仅仅是存储桶域名。 -
创建一个 Azure 存储策略,粘贴存储账户资源标识符。
返回数据连接中的 Snowflake 源,确认您可以探索该源并运行同步。
Iceberg 表(仅限虚拟表)¶
本文档的虚拟表(Virtual tables)部分提供了有关集成在 Snowflake Horizon Catalog 中注册的 Iceberg 表的详细信息。此功能需要与存储 Iceberg 表的外部卷(External volume)建立网络连接。您必须为每个外部卷创建网络出站策略(Network egress policies),因为 Foundry 直接读取和写入存储位置,而不是使用 Snowflake 计算仓库查询表。
有关外部卷以及如何确定表存储位置的更多信息,请参阅 Snowflake 官方文档 ↗。
:::callout{theme="neutral"}
为外部卷配置出站策略可使网络流量从 Foundry 出站到该存储位置。不同云提供商的网络控制可能有所不同,因此您应确保存储位置上的网络控制允许来自 Foundry 的网络流量。了解有关识别 Foundry 流量来源 IP 地址的更多信息。。
此外,请参阅本网络(Networking)文档的其他部分,以确保正确设置与托管在与 Foundry 实例相同云区域的外部卷的连接。
:::
虚拟表¶
本节提供有关将虚拟表(Virtual tables)与 Snowflake 源一起使用的其他详细信息。本节不适用于同步到 Foundry 数据集。
:::callout{theme="neutral"}
Snowflake 连接器现在在使用虚拟表访问在 Horizon Catalog 中注册的 Iceberg 表时提供了增强功能。Foundry 使用 Horizon Catalog 中公开的 Iceberg REST API 来访问表,并在底层存储位置(在 Snowflake 中配置为外部卷)中读取和写入数据。此外,Foundry 使用 Horizon Catalog 凭据分发(Credential vending)来确保对云对象存储的安全访问。这还可以提高对这些表的读写性能。
如果您满足以下条件,连接器会自动公开 Iceberg 功能:
- 配置允许从 Foundry 连接到存储表的外部卷的网络出站策略。
- 在源上配置具有从 Horizon Catalog 获取分发凭据权限的凭据。
Foundry 使用 Iceberg 客户端建立连接,直接读取或写入表到存储位置,无需使用 Snowflake 计算。但是,Foundry 仍然使用源上配置的仓库进行某些元数据查询,例如确定正在访问的表类型。
有关通过 Snowflake Horizon Catalog 使用外部引擎查询 Iceberg 表的更多信息,请参阅 Snowflake 官方文档 ↗。对 Horizon Catalog 的写入目前是 Snowflake 的 private preview 功能。有关启用对外部卷的网络访问的详细信息,请参阅本文档的网络(Networking)部分。
如果上述任何要求未满足,Foundry 会将 Iceberg 表视为常规 Snowflake 表。到 Snowflake 的连接使用与其他 Snowflake 数据(如表、视图或物化视图)相同的机制建立,并依赖 Snowflake 计算仓库来读取和写入表。 :::
下表突出显示了 Snowflake 支持的虚拟表功能。
| 功能 | 状态 |
|---|---|
| 批量注册(Bulk registration) | 🟢 正式可用 |
| 自动注册(Automatic registration) | 🟢 正式可用 |
| 表输入(Table inputs) | 🟢 正式可用:代码仓库(Code Repositories)、管道构建器(Pipeline Builder)中的表、视图、物化视图 |
| 表输出(Table outputs) | 🟢 正式可用:代码仓库(Code Repositories)、管道构建器(Pipeline Builder)中的表 |
| 增量管道(Incremental pipelines) | 🟢 正式可用:仅 APPEND,仅表 [1] |
| 计算下推(Compute pushdown) | 🟢 正式可用:Python 转换(Python transforms)、管道构建器(Pipeline Builder) |
请查阅虚拟表文档,了解 Snowflake 表可用作输入或输出的受支持 Foundry 工作流的详细信息。
[1] 增量管道仅支持表。视图不支持增量管道,因为它们没有版本控制。要为基于 Snowflake 虚拟表的管道启用增量支持,请确保为适当的保留期启用变更跟踪(Change Tracking) ↗和时间旅行(Time Travel) ↗。此功能依赖于 CHANGES ↗。Python 转换(Python Transforms) 中支持 current 和 added 读取模式。这些模式将根据 METADATA$ACTION 列公开变更源的相关行。METADATA$ACTION、METADATA$ISUPDATE、METADATA$ROW_ID 列将在 Python 转换中可用。
源凭据的权限¶
为了获得完整的功能支持,您应为源连接配置的凭据提供以下权限。您应根据所需的继承模型,在数据库、模式或表上应用这些权限。
| 类别 | 权限 | 备注 |
|---|---|---|
| 先决条件 | USAGE |
必须授予将在 Foundry 中使用的 Snowflake 数据库和模式。 |
| 读取 | SELECT |
使用同步或虚拟表输入读取 Snowflake 表时需要。 |
| 编辑 | DELETE, INSERT, TRUNCATE, UPDATE |
使用虚拟表输出修改 Snowflake 表时需要。 |
| 创建 | CREATE SCHEMA, CREATE TABLE |
使用虚拟表输出创建 Snowflake 表时需要。 |
使用 Iceberg 表时,需要对存储表的外部卷具有 USAGE 权限。
此外,提供的凭据必须对源配置中提供的仓库具有使用权限。
有关 Snowflake 中访问控制权限的更多信息,请参阅 Snowflake 官方文档 ↗。
源配置要求¶
使用虚拟表(Virtual tables)时,请记住以下源配置要求:
- 您必须使用 Foundry 工作节点(Foundry worker) 源。虚拟表不支持使用代理工作节点(agent worker)连接。
- 确保按照本文档的网络部分所述建立双向连接和白名单。
- 如果在代码仓库中使用虚拟表,请参阅虚拟表文档了解所需的额外源配置详情。
- 您必须在连接详情中指定一个仓库。
- 提供的凭据必须对仓库具有使用权限。
有关更多详细信息,请参见上面的连接详情(Connection Details)部分。
计算下推¶
Foundry 提供在使用虚拟表时将计算下推到 Snowflake 的能力。虚拟表输入利用 Snowflake Spark 连接器 ↗,该连接器内置了对谓词下推(Predicate pushdown)的支持。
当使用注册到同一源的 Snowflake 虚拟表作为管道的输入和输出时,可以将计算完全联合到 Snowflake。要将计算下推到 Snowflake,请查看 Python 文档了解详情。要在管道构建器中将计算下推到 Snowflake,请查看外部管道文档。
在代码中使用 Snowflake 源¶
对于更复杂的场景,您可以使用纯代码替代方案(Pro-code alternatives)连接到 Snowflake 源。
以下示例演示了如何在外部转换(External transform)中使用 Snowflake Connector for Python ↗ (snowflake-connector-python) 连接到 Snowflake 源。身份验证通过 OIDC 处理,它提供短期 OAuth 令牌,无需静态 Snowflake 凭据。
使用外部转换从 Snowflake 读取¶
此示例使用基于 OIDC 的 OAuth 凭据从 Snowflake 表读取数据。
import polars as pl
from transforms.api import transform, Output, LightweightOutput
from transforms.external.systems import ResolvedSource, external_systems, Source
import snowflake.connector
@external_systems(
snowflake_source=Source("<source_rid>")
)
@transform.using(
output=Output("<output_dataset_rid>")
)
def read_from_snowflake(output: LightweightOutput, snowflake_source: ResolvedSource):
conn = snowflake.connector.connect(
authenticator="oauth",
token=snowflake_source.get_session_credentials().get().access_token,
account="<account_identifier>",
warehouse="<warehouse_name>",
database="<database_name>",
)
cursor = conn.cursor()
try:
cursor.execute("SELECT * FROM my_table LIMIT 100")
columns = [col[0] for col in cursor.description]
rows = cursor.fetchall()
finally:
cursor.close()
conn.close()
df = pl.DataFrame(rows, schema=columns)
output.write_table(df)
使用外部转换写入 Snowflake¶
此示例将数据从 Foundry 数据集导出到 Snowflake 表。
import polars as pl
from transforms.api import transform, Input, Output, LightweightOutput
from transforms.external.systems import ResolvedSource, external_systems, Source
import snowflake.connector
@external_systems(
snowflake_source=Source("<source_rid>")
)
@transform.using(
output=Output("<output_dataset_rid>"),
source_data=Input("<input_dataset_rid>"),
)
def write_to_snowflake(output: LightweightOutput, snowflake_source: ResolvedSource, source_data):
conn = snowflake.connector.connect(
authenticator="oauth",
token=snowflake_source.get_session_credentials().get().access_token,
account="<account_identifier>",
warehouse="<warehouse_name>",
database="<database_name>",
)
cursor = conn.cursor()
try:
df = source_data.dataframe()
for row in df.iter_rows(named=True):
cursor.execute(
"INSERT INTO my_table (col1, col2) VALUES (%s, %s)",
(row["col1"], row["col2"]),
)
finally:
cursor.close()
conn.close()
output.write_table(df)
:::callout{theme="neutral"} 有关在启用 OIDC 的源中使用会话凭据的更多详细信息,请查看 Python 中的源(Sources in Python)文档。 :::
数据模型¶
请注意,类型为 array ↗、object ↗ 和 variant ↗ 的列将被 Foundry 解析为 string 类型。这是由于源的变量类型所致。
例如,Snowflake 数组 [ 1, 2, 3 ] 将被 Foundry 解释为字符串 "[1,2,3]"。
更多详细信息,请参见 Snowflake 的官方文档 ↗。