Databricks¶
Connect Foundry to Databricks to leverage a range of capabilities on top of data, compute, and models available within Databricks.
Supported capabilities¶
| Capability | Status |
|---|---|
| Exploration | 🟢 Generally available |
| Bulk import | 🟢 Generally available |
| Incremental | 🟢 Generally available |
| Virtual tables | 🟢 Generally available |
| Compute pushdown | 🟢 Generally available: Python transforms, Pipeline Builder |
| External models | 🟢 Generally available |
:::callout{theme="success"} The Databricks connector now offers enhanced functionality when using virtual tables to expose the features of Delta Lake and Apache Iceberg. Refer to the Virtual Tables section of this documentation for information and details on how to configure the connector to enable this functionality. :::
Setup¶
- Open the Data Connection application and select + New Source in the upper right corner of the screen.
- Select Databricks from the available connector types.
- Follow the additional configuration prompts to continue the setup of your connector using the information in the sections below.
Learn more about setting up a connector in Foundry.
Connection details¶
The following configuration options are available for the Databricks connector:
| Option | Required? | Default | Description |
|---|---|---|---|
Hostname |
Yes | The hostname of the Databricks workspace. | |
HTTP Path |
Yes | The Databricks compute resource’s HTTP Path value. This can be either a:
|
|
Cloud Fetch |
Yes | True | Indicates whether Cloud Fetch should be enabled. Refer to the Networking documentation below to ensure suitable network connectivity to the cloud storage locations. |
Refer to the official Databricks documentation ↗ for information on how to obtain these values.
Authentication¶
You can authenticate with Databricks in the following ways:
| Method | Description | Documentation |
|---|---|---|
Basic authentication [Legacy] |
Authenticate with a user account using username and password. Basic authentication is legacy and not recommended in production. | Basic authentication ↗ |
OAuth machine-to-machine |
Authenticate as a service principal using OAuth. Create a service principal in Databricks and generate an OAuth secret to obtain a client ID and secret. | OAuth for service principals (OAuth M2M) ↗ |
Personal access token |
Authenticate as a user or service principal using a personal access token. | Personal access tokens (PAT) ↗. |
Workload identity federation [Recommended] |
Authenticate as a service principal using workload identity federation. Workload identity federation allows workloads running in Foundry to access Databricks APIs without the need for Databricks secrets. Create a service principal federation policy in Databricks and follow the displayed instructions to allow the source to securely authenticate as a service principal. | Databricks OAuth token federation ↗ Refer to our OIDC documentation for an overview of how OpenID Connect (OIDC) is supported in Foundry. |
For full feature support, ensure that the credentials provided have been granted the relevant privileges on the relevant catalog and compute resources.
:::callout{theme="neutral"} When using OAuth machine-to-machine authentication in Azure Databricks, be sure to use your Databricks service principal client ID and secret. Authentication using a Microsoft Entra service principal is not supported. Learn more about service principals in Azure Databricks. ↗ :::
Networking¶
For Databricks connections, add the appropriate egress policies when setting up the source in the Data Connection application.
:::callout{theme="neutral"}
Databricks connections typically open a large number of connections at the same time. When using agent proxy egress policies, you may exhaust the connection pool on your agent. If you experience connection pool errors, increase the maxConnections and coreConnections settings in your agent proxy configuration.
:::
The Databricks connector requires network access to the Hostname provided in the configuration options on port 443. This grants access for Foundry to connect to the Databricks workspace and Unity Catalog REST APIs.
Cloud Fetch¶
Cloud Fetch is a feature of the Databricks JDBC driver. Cloud Fetch enables parallel data extraction from Databricks to Foundry through cloud storage, delivering up to 10x faster performance compared to traditional single-threaded transfers.
When enabled, additional network policies may be needed to allow outbound connections to the cloud storage service (AWS S3, Azure Data Lake Storage, or Google Cloud Storage) where Databricks temporarily stores query results. If you are using a Foundry worker connection, egress policies will need to be created for the workspace storage bucket. This is the cloud storage location used by Cloud Fetch.
Review Databricks' official documentation ↗ for details.
External access to storage locations (virtual tables only)¶
The Virtual Tables section of this documentation provides details on external access in Unity Catalog and the functionality it enables. External access requires network connectivity to a table's storage location (managed or external). Egress policies will need to be created for each storage location to benefit from the features enabled by external access.
Refer to the official Databricks documentation ↗ for more information on external access and how to determine the storage locations of tables.
:::callout{theme="neutral"} When you configure egress policies for a storage location, this allows network traffic to egress from Foundry to that storage location. Additionally, you should ensure that any network controls on the storage location permit network traffic from Foundry. These network controls will vary depending on the cloud provider. Learn more about identifying the IPs where Foundry traffic originates. :::
Examples¶
Below we provide example egress policies that may need to be configured to ensure network connectivity to Databricks.
| Type | URL | DNS | Port |
|---|---|---|---|
| Databricks workspace | https://adb-5555555555555555.19.azuredatabricks.net/ |
adb-5555555555555555.19.azuredatabricks.net |
443 |
| Azure storage location [1] | abfss://<container-name>@<account-name>.dfs.core.windows.net/<table-directory> |
<account-name>.dfs.core.windows.net<account-name>.blob.core.windows.net |
443 |
| Google Cloud Storage (GCS) storage location | gs://<bucket-path>/<table-directory> |
storage.googleapis.com |
443 |
| S3 storage location | s3://<bucket-path>/<table-directory> |
<bucket-path>.s3.<region>.amazonaws.com |
443 |
[1] Be sure to include both blob.core.<endpoint> and dfs.core.<endpoint> domains when configuring access to Azure storage locations. endpoint may vary depending on the Azure Cloud environment.
:::callout{theme="neutral"} In a limited number of cases (depending on your Foundry and Databricks environments) it may be necessary to establish a connection via PrivateLink. This is typically the case where both Foundry and Databricks are hosted by the same CSP (for example, AWS-AWS or Azure-Azure.) If you believe this applies to your setup, contact your Palantir representative for additional guidance. :::
:::callout{theme="neutral"} For egress policies that depend on an S3 bucket in the same region as your Foundry instance, ensure you have completed the additional configuration steps detailed in our Amazon S3 bucket policy documentation for the affected bucket(s). :::
More options: SSL and hostname validation¶
You may additionally need to pass in a JDBC property to allow self-signed certificates.
How to identify if this property is needed:
- SSL connections validate server certificates. Normally, SSL validations happen through a certificate chain. By default, both agent and Foundry workers trust most industry-standard certificate chains.
- If the server to which you are connecting has a self-signed certificate, or if a firewall performs TLS interception on the connection, the connector must trust the certificate. Learn more about using certificates in agent-based connections.
- If you are creating a Foundry worker connection and are using a self-signed certificate, you will need to add a JDBC property for the
AllowSelfSignedCerts=1property.
How to add the property allowing self-signed certificates:
- At the bottom of the Connection details page under Connection settings select More options then JDBC properties.
- Under JDBC properties configuration, select Add property then New property then enter
AllowSelfSignedCertsas the key and1as the value.
:::callout{theme="neutral"}
When the AllowSelfSignedCerts property is set to 1, SSL verification is disabled. In this case, the connector does not verify the server certificate against the trust store, and does not verify if the server's host name matches the common name or subject alternative names in the server certificate.
This JDBC property and others are outlined in the Databricks driver documentation ↗. The JDBC properties outlined in this documentation are specific to the Databricks driver and will differ from other source types. :::
:::callout{theme="warning"}
The server must provide the full certificate chain in order for SSL verification to work. The certificate chain for the Databricks server can be obtained by running the command openssl s_client -connect {hostname}:{port} -showcerts. To verify the certificate chain, use the OpenSSL command line utility or any other available tool.
:::
Virtual tables¶
Virtual tables allow you to connect to data registered in Databricks Unity Catalog. This allows you to both read and write to tables in Databricks from Foundry as well as push down compute to Databricks from pipelines in Foundry. This section provides additional details around using virtual tables with Databricks. This section is not applicable when syncing to Foundry datasets.
:::callout{theme="neutral"} The Databricks connector now offers enhanced functionality when using virtual tables to expose the features of Delta Lake and Apache Iceberg. This functionality requires external access to be enabled in Unity Catalog. When enabled, external access allows Foundry to access tables using the Unity REST API and Iceberg REST catalog, and read and write data in the underlying storage locations. Unity Catalog credential vending is used to ensure secure access to cloud object storage. In addition to enhanced functionality, this can also improve the performance of reads and writes against these tables.
The Databricks connector automatically exposes Delta Lake and Apache Iceberg functionality if you:
- Enable external access in Unity Catalog.
- Configure network egress policies that allow connectivity from Foundry to the table's storage location.
- Configure credentials on the source that can obtain vended credentials from Unity Catalog.
Connections to read or write tables will be made to the storage location directly using Delta Lake or Apache Iceberg clients. Databricks compute will not be used to read or write to the tables. The Unity Catalog REST APIs will be used for certain metadata operations such as determining the type of table being accessed.
Refer to the official Databricks documentation ↗ for more information on external access and how to determine the storage locations of tables. Refer to the Networking section of this documentation for details on enabling network access to storage locations.
Unity Catalog credential vending is required to use external access. Credential vending is not supported for all table types and table features. For example, views or tables with row filters do not support credential vending. Refer to the official Databricks documentation ↗ for more information on credential vending and the requirements.
If any of the above requirements are not met, connections to Databricks will be made using JDBC. JDBC is the same mechanism used for syncs. Refer to the official Databricks documentation ↗ for more information on JDBC connectivity to Databricks. :::
The table below highlights the virtual table capabilities that are supported for Databricks.
| Capability | Status |
|---|---|
| Bulk registration | 🟢 Generally available |
| Automatic registration | 🟢 Generally available |
| Table inputs | 🟢 Generally available: tables, views, materialized views in Code Repositories, Pipeline Builder |
| Table outputs | 🟢 Generally available: Code Repositories, Pipeline Builder |
| Incremental pipelines | 🟢 Generally available [2] |
| Compute pushdown | 🟢 Generally available: Python transforms, Pipeline Builder |
Consult the virtual tables documentation for details on the supported Foundry workflows where Databricks tables can be used as inputs or outputs. Functionality may vary depending on whether external access is enabled.
[2] To enable incremental support for Spark pipelines backed by Databricks virtual tables, external access must be enabled; incremental computation requires the ability to directly interact with Delta or Iceberg tables. Incremental compute on top of Delta tables relies on Change Data Feed ↗. Incremental compute on top of Iceberg tables relies on Incremental Reads ↗.
Table format and storage locations¶
The following table provides a summary of the supported formats and workflows when external access is or is not enabled.
| Unity Catalog object | External access required | Format | Table inputs | Table outputs |
|---|---|---|---|---|
| Managed table | Yes | Avro ↗, Delta ↗, Parquet ↗ | ✔️ | |
| Managed table | Yes | Iceberg ↗ | ✔️ | ✔️ |
| External table | Yes | Delta | ✔️ | ✔️ |
| External table | Yes | Avro, Parquet | ✔️ | |
| Managed table | No | Table ↗, View ↗, Materialized view | ✔️ | |
| External table | No | Table, view, materialized view | ✔️ |
:::callout{theme="neutral"} When using Spark pipelines, virtual table outputs require external access to be enabled. The virtual table output must be either a managed Iceberg or external Delta table in Databricks. :::
Privileges on source credentials¶
For full feature support, we recommend providing the following privileges to the credentials provided for the source connection. These should be applied on either the catalog, schema, or table depending on the desired inheritance model.
| Category | Privilege | Notes |
|---|---|---|
| Prerequisite | USE CATALOG, USE SCHEMA |
Must be granted on the Databricks catalogs and schemas that will be used in Foundry. |
| Metadata | BROWSE |
Required to explore source and register tables. |
| Read | SELECT |
Required to read Databricks tables when using syncs or virtual table inputs. |
| Edit | MODIFY |
Required to modify Databricks tables when using virtual table outputs. |
| Create | CREATE SCHEMA, CREATE TABLE |
Required to create Databricks tables when using virtual table outputs. |
| Other | EXTERNAL USE SCHEMA |
Enables external access to storage locations. Refer to external access for more details. |
When using external tables, we recommend granting BROWSE, CREATE EXTERNAL TABLE, and EXTERNAL USE LOCATION privileges on the external locations being used. These are required when using virtual table outputs to create external tables.
Additionally, the credentials provided must have usage privileges on the warehouse or compute cluster provided in the source configuration.
Refer to the official Databricks documentation ↗ for more information on managing privileges in Unity Catalog.
Source configuration requirements¶
When using virtual tables, remember the following source configuration requirements:
- You must use a Foundry worker source. Virtual tables do not support use of agent worker connections.
- Ensure that bi-directional connectivity and allowlisting is established as described in the Networking section of this documentation, including the recommended networking to storage locations.
- If using virtual tables in Code Repositories, refer to the Virtual Tables documentation for details of additional source configuration required.
- You must specify a warehouse or compute cluster in the connection details using the
HTTP pathfield. Refer to the official Databricks documentation ↗ on getting connection details for a Databricks compute resource.
See the Connection Details section above for more details.
Compute pushdown¶
Foundry offers the ability to push down compute to Databricks when using virtual tables. When using Databricks virtual tables registered to the same source as inputs and outputs to a pipeline, it is possible to fully federate compute to Databricks. See the Python documentation for details on how to push down compute to Databricks in Python Transforms. To push down compute to Databricks in Pipeline Builder, review the External pipelines documentation..
External models¶
Databricks models registered in Unity Catalog can be integrated to Foundry via:
Refer to the official Databricks documentation ↗ for more information on making models available in Unity Catalog, and to the guide on setting up Databricks external models in Foundry.
中文翻译¶
Databricks¶
将 Foundry 连接到 Databricks,以利用 Databricks 中数据、计算和模型之上的多种功能。
支持的功能¶
| 功能 | 状态 |
|---|---|
| 数据探索 (Exploration) | 🟢 正式发布 (Generally available) |
| 批量导入 (Bulk import) | 🟢 正式发布 |
| 增量同步 (Incremental) | 🟢 正式发布 |
| 虚拟表 (Virtual tables) | 🟢 正式发布 |
| 计算下推 (Compute pushdown) | 🟢 正式发布:Python 转换 (Python transforms)、Pipeline Builder |
| 外部模型 (External models) | 🟢 正式发布 |
:::callout{theme="success"} Databricks 连接器现在在使用虚拟表 (Virtual Tables) 来展现 Delta Lake 和 Apache Iceberg 的功能时,提供了增强功能。请参阅本文档的虚拟表 (Virtual Tables) 部分,了解如何配置连接器以启用此功能的信息和详情。 :::
设置 (Setup)¶
- 打开 数据连接 (Data Connection) 应用程序,并在屏幕右上角选择 + 新建数据源 (+ New Source)。
- 从可用的连接器类型中选择 Databricks。
- 使用以下部分的信息,按照额外的配置提示继续设置您的连接器。
了解更多关于在 Foundry 中设置连接器的信息。
连接详情 (Connection details)¶
Databricks 连接器提供以下配置选项:
| 选项 | 是否必需? | 默认值 | 描述 |
|---|---|---|---|
主机名 (Hostname) |
是 | Databricks 工作区的主机名。 | |
HTTP 路径 (HTTP Path) |
是 | Databricks 计算资源的 HTTP 路径值。可以是以下之一:
|
|
云获取 (Cloud Fetch) |
是 | True | 指示是否应启用云获取 (Cloud Fetch)。请参阅下面的网络 (Networking) 文档,以确保与云存储位置有合适的网络连接。 |
请参阅官方 Databricks 文档 ↗ 了解如何获取这些值。
身份验证 (Authentication)¶
您可以通过以下方式对 Databricks 进行身份验证:
| 方法 | 描述 | 文档 |
|---|---|---|
基本身份验证 (Basic authentication) [旧版] |
使用用户名和密码通过用户账户进行身份验证。基本身份验证是旧版方法,不建议在生产环境中使用。 | 基本身份验证 ↗ |
OAuth 机器对机器 (OAuth machine-to-machine) |
使用 OAuth 作为服务主体 (Service Principal) 进行身份验证。在 Databricks 中创建一个服务主体并生成 OAuth 密钥以获取客户端 ID 和密钥。 | 服务主体的 OAuth (OAuth M2M) ↗ |
个人访问令牌 (Personal access token) |
使用个人访问令牌作为用户或服务主体进行身份验证。 | 个人访问令牌 (PAT) ↗。 |
工作负载身份联合 (Workload identity federation) [推荐] |
使用工作负载身份联合 (Workload Identity Federation) 作为服务主体进行身份验证。工作负载身份联合允许在 Foundry 中运行的工作负载访问 Databricks API,而无需 Databricks 密钥。在 Databricks 中创建一个服务主体联合策略,并按照显示的说明允许数据源安全地作为服务主体进行身份验证。 | Databricks OAuth 令牌联合 ↗ 请参阅我们的 OIDC 文档,了解 Foundry 如何支持 OpenID Connect (OIDC) 的概述。 |
为了获得完整的功能支持,请确保授予所提供的凭据在相关目录和计算资源上具有相关权限。
:::callout{theme="neutral"} 在 Azure Databricks 中使用 OAuth 机器对机器 (OAuth machine-to-machine) 身份验证时,请务必使用您的 Databricks 服务主体客户端 ID 和密钥。不支持使用 Microsoft Entra 服务主体进行身份验证。了解有关 Azure Databricks 中服务主体的更多信息。 ↗ :::
网络 (Networking)¶
对于 Databricks 连接,在数据连接应用程序中设置数据源时,请添加适当的出口策略 (egress policies)。
:::callout{theme="neutral"}
Databricks 连接通常会同时打开大量连接。当使用代理出口策略 (agent proxy egress policies) 时,您可能会耗尽代理上的连接池。如果遇到连接池错误,请增加代理配置中的 maxConnections 和 coreConnections 设置。
:::
Databricks 连接器需要网络访问配置选项中提供的 主机名 (Hostname) 的 443 端口。这允许 Foundry 连接到 Databricks 工作区和 Unity Catalog REST API。
云获取 (Cloud Fetch)¶
云获取 (Cloud Fetch) 是 Databricks JDBC 驱动程序的一项功能。它允许通过云存储从 Databricks 到 Foundry 进行并行数据提取,与传统单线程传输相比,性能提升高达 10 倍。
启用后,可能需要额外的网络策略来允许出站连接到 Databricks 临时存储查询结果的云存储服务(AWS S3、Azure Data Lake Storage 或 Google Cloud Storage)。如果您使用的是 Foundry 工作器连接,则需要为工作区存储桶创建出口策略。这是云获取 (Cloud Fetch) 使用的云存储位置。
查看 Databricks 官方文档 ↗ 了解详情。
对存储位置的外部访问(仅限虚拟表)¶
本文档的虚拟表 (Virtual Tables) 部分提供了有关 Unity Catalog 中外部访问 (external access) 及其所启用功能的详细信息。外部访问需要网络连接到表的存储位置(托管或外部)。需要为每个存储位置创建出口策略,以受益于外部访问所启用的功能。
请参阅官方 Databricks 文档 ↗ 了解更多关于外部访问以及如何确定表存储位置的信息。
:::callout{theme="neutral"} 当您为存储位置配置出口策略时,这允许网络流量从 Foundry 出站到该存储位置。此外,您应确保存储位置上的任何网络控制都允许来自 Foundry 的网络流量。这些网络控制将因云提供商而异。了解有关识别 Foundry 流量来源 IP 的更多信息。 :::
示例¶
下面我们提供了可能需要配置的示例出口策略,以确保与 Databricks 的网络连接。
| 类型 | URL | DNS | 端口 |
|---|---|---|---|
| Databricks 工作区 | https://adb-5555555555555555.19.azuredatabricks.net/ |
adb-5555555555555555.19.azuredatabricks.net |
443 |
| Azure 存储位置 [1] | abfss://<container-name>@<account-name>.dfs.core.windows.net/<table-directory> |
<account-name>.dfs.core.windows.net<account-name>.blob.core.windows.net |
443 |
| Google Cloud Storage (GCS) 存储位置 | gs://<bucket-path>/<table-directory> |
storage.googleapis.com |
443 |
| S3 存储位置 | s3://<bucket-path>/<table-directory> |
<bucket-path>.s3.<region>.amazonaws.com |
443 |
[1] 在配置对 Azure 存储位置的访问时,请确保同时包含 blob.core.<endpoint> 和 dfs.core.<endpoint> 域。endpoint 可能因 Azure 云环境而异。
:::callout{theme="neutral"} 在少数情况下(取决于您的 Foundry 和 Databricks 环境),可能需要通过 PrivateLink 建立连接。这通常发生在 Foundry 和 Databricks 由同一 CSP 托管时(例如,AWS-AWS 或 Azure-Azure)。如果您认为这适用于您的设置,请联系您的 Palantir 代表以获取额外指导。 :::
:::callout{theme="neutral"} 对于依赖于与您的 Foundry 实例位于同一区域的 S3 存储桶的出口策略,请确保您已完成我们的 Amazon S3 存储桶策略文档中详述的受影响存储桶的额外配置步骤。 :::
更多选项:SSL 和主机名验证¶
您可能还需要传入一个 JDBC 属性以允许自签名证书。
如何判断是否需要此属性:
- SSL 连接会验证服务器证书。通常,SSL 验证通过证书链进行。默认情况下,代理和 Foundry 工作器都信任大多数行业标准的证书链。
- 如果您连接的服务器具有自签名证书,或者防火墙对连接执行 TLS 拦截,则连接器必须信任该证书。了解更多关于在基于代理的连接中使用证书的信息。
- 如果您正在创建 Foundry 工作器连接并使用自签名证书,则需要为
AllowSelfSignedCerts=1属性添加一个 JDBC 属性。
如何添加允许自签名证书的属性:
- 在连接详情 (Connection details) 页面底部的连接设置 (Connection settings) 下,选择更多选项 (More options),然后选择 JDBC 属性 (JDBC properties)。
- 在 JDBC 属性 (JDBC properties) 配置下,选择添加属性 (Add property),然后选择新建属性 (New property),接着输入
AllowSelfSignedCerts作为键,1作为值。
:::callout{theme="neutral"}
当 AllowSelfSignedCerts 属性设置为 1 时,SSL 验证被禁用。在这种情况下,连接器不会根据信任库验证服务器证书,也不会验证服务器的主机名是否与服务器证书中的通用名称或主题备用名称匹配。
此 JDBC 属性及其他属性在 Databricks 驱动程序文档 ↗ 中有概述。本文档中概述的 JDBC 属性特定于 Databricks 驱动程序,并会与其他数据源类型不同。 :::
:::callout{theme="warning"}
服务器必须提供完整的证书链才能使 SSL 验证正常工作。可以通过运行命令 openssl s_client -connect {hostname}:{port} -showcerts 来获取 Databricks 服务器的证书链。要验证证书链,请使用 OpenSSL 命令行工具或任何其他可用工具。
:::
虚拟表 (Virtual tables)¶
虚拟表 (Virtual tables) 允许您连接到在 Databricks Unity Catalog 中注册的数据。这使您能够从 Foundry 读取和写入 Databricks 中的表,以及将计算从 Foundry 的管道下推到 Databricks。本节提供了有关将虚拟表与 Databricks 结合使用的更多详细信息。本节不适用于同步到 Foundry 数据集的情况。
:::callout{theme="neutral"} Databricks 连接器现在在使用虚拟表 (Virtual Tables) 来展现 Delta Lake 和 Apache Iceberg 的功能时,提供了增强功能。此功能需要在 Unity Catalog 中启用外部访问 (external access)。启用后,外部访问允许 Foundry 使用 Unity REST API 和 Iceberg REST 目录访问表,并在底层存储位置读取和写入数据。Unity Catalog 凭据分发 (credential vending) 用于确保对云对象存储的安全访问。除了增强功能外,这还可以提高对这些表的读写性能。
如果您满足以下条件,Databricks 连接器会自动展现 Delta Lake 和 Apache Iceberg 功能:
- 在 Unity Catalog 中启用外部访问。
- 配置允许从 Foundry 连接到表存储位置的网络出口策略。
- 在数据源上配置能够从 Unity Catalog 获取分发凭据的凭据。
读取或写入表的连接将直接使用 Delta Lake 或 Apache Iceberg 客户端连接到存储位置。Databricks 计算将不用于读取或写入表。Unity Catalog REST API 将用于某些元数据操作,例如确定正在访问的表类型。
请参阅官方 Databricks 文档 ↗ 了解更多关于外部访问以及如何确定表存储位置的信息。请参阅本文档的网络 (Networking) 部分,了解有关启用对存储位置的网络访问的详细信息。
使用外部访问需要 Unity Catalog 凭据分发。并非所有表类型和表功能都支持凭据分发。例如,视图或具有行过滤器的表不支持凭据分发。请参阅官方 Databricks 文档 ↗ 了解更多关于凭据分发及其要求的信息。
如果上述任何要求未满足,则将使用 JDBC 建立与 Databricks 的连接。JDBC 是与同步 (syncs) 相同的机制。请参阅官方 Databricks 文档 ↗ 了解更多关于与 Databricks 的 JDBC 连接的信息。 :::
下表突出显示了 Databricks 支持的虚拟表功能。
| 功能 | 状态 |
|---|---|
| 批量注册 (Bulk registration) | 🟢 正式发布 |
| 自动注册 (Automatic registration) | 🟢 正式发布 |
| 表输入 (Table inputs) | 🟢 正式发布:在代码仓库 (Code Repositories)、Pipeline Builder 中的表、视图、物化视图 |
| 表输出 (Table outputs) | 🟢 正式发布:代码仓库 (Code Repositories)、Pipeline Builder |
| 增量管道 (Incremental pipelines) | 🟢 正式发布 [2] |
| 计算下推 (Compute pushdown) | 🟢 正式发布:Python 转换 (Python transforms)、Pipeline Builder |
请查阅虚拟表文档,了解支持将 Databricks 表用作输入或输出的 Foundry 工作流的详细信息。功能可能因是否启用外部访问而异。
[2] 要为基于 Databricks 虚拟表的 Spark 管道启用增量支持,必须启用外部访问;增量计算需要能够直接与 Delta 或 Iceberg 表交互。基于 Delta 表的增量计算依赖于变更数据馈送 (Change Data Feed) ↗。基于 Iceberg 表的增量计算依赖于增量读取 (Incremental Reads) ↗。
表格式和存储位置¶
下表总结了在启用或未启用外部访问时支持的格式和工作流。
| Unity Catalog 对象 | 是否需要外部访问 | 格式 | 表输入 | 表输出 |
|---|---|---|---|---|
| 托管表 (Managed table) | 是 | Avro ↗、Delta ↗、Parquet ↗ | ✔️ | |
| 托管表 (Managed table) | 是 | Iceberg ↗ | ✔️ | ✔️ |
| 外部表 (External table) | 是 | Delta | ✔️ | ✔️ |
| 外部表 (External table) | 是 | Avro, Parquet | ✔️ | |
| 托管表 (Managed table) | 否 | 表 (Table) ↗、视图 (View) ↗、物化视图 (Materialized view) | ✔️ | |
| 外部表 (External table) | 否 | 表、视图、物化视图 | ✔️ |
:::callout{theme="neutral"} 使用 Spark 管道时,虚拟表输出需要启用外部访问。虚拟表输出必须是 Databricks 中的托管 Iceberg 或外部 Delta 表。 :::
源凭据的权限¶
为了获得完整的功能支持,我们建议为数据源连接提供的凭据授予以下权限。这些权限应根据所需的继承模型应用于目录、模式或表。
| 类别 | 权限 | 备注 |
|---|---|---|
| 先决条件 | USE CATALOG, USE SCHEMA |
必须授予将在 Foundry 中使用的 Databricks 目录和模式。 |
| 元数据 | BROWSE |
浏览数据源和注册表所需。 |
| 读取 | SELECT |
使用同步或虚拟表输入读取 Databricks 表所需。 |
| 编辑 | MODIFY |
使用虚拟表输出修改 Databricks 表所需。 |
| 创建 | CREATE SCHEMA, CREATE TABLE |
使用虚拟表输出创建 Databricks 表所需。 |
| 其他 | EXTERNAL USE SCHEMA |
启用对存储位置的外部访问。有关更多详细信息,请参阅外部访问。 |
使用外部表时,我们建议授予对所使用外部位置的 BROWSE、CREATE EXTERNAL TABLE 和 EXTERNAL USE LOCATION 权限。使用虚拟表输出创建外部表时需要这些权限。
此外,提供的凭据必须对数据源配置中提供的仓库或计算集群具有使用权限。
请参阅官方 Databricks 文档 ↗ 了解更多关于在 Unity Catalog 中管理权限的信息。
源配置要求¶
使用虚拟表 (Virtual tables) 时,请记住以下源配置要求:
- 您必须使用 Foundry 工作器 (Foundry worker) 源。虚拟表不支持使用代理工作器 (agent worker) 连接。
- 确保按照本文档的网络部分所述建立双向连接和允许列表,包括推荐的对存储位置的网络连接。
- 如果在代码仓库中使用虚拟表,请参阅虚拟表文档了解所需额外源配置的详细信息。
- 您必须在连接详情中使用
HTTP 路径 (HTTP path)字段指定一个仓库或计算集群。请参阅官方 Databricks 文档 ↗ 了解如何获取 Databricks 计算资源的连接详情。
有关更多详细信息,请参阅上面的连接详情 (Connection Details) 部分。
计算下推 (Compute pushdown)¶
Foundry 提供在使用虚拟表时将计算下推到 Databricks 的能力。当使用注册到同一数据源的 Databricks 虚拟表作为管道的输入和输出时,可以将计算完全联合到 Databricks。请参阅 Python 文档,了解如何在 Python 转换 (Python Transforms) 中将计算下推到 Databricks。要将计算下推到 Pipeline Builder 中的 Databricks,请查看外部管道文档。
外部模型 (External models)¶
在 Unity Catalog 中注册的 Databricks 模型可以通过以下方式集成到 Foundry:
请参阅官方 Databricks 文档 ↗ 了解更多关于如何在 Unity Catalog 中提供模型的信息,并参阅指南了解如何在 Foundry 中设置 Databricks 外部模型。