跳转至

Branch isolation in Foundry's Iceberg catalog(Foundry Iceberg 目录中的分支隔离)

Foundry's Iceberg catalog extends standard Iceberg so that each branch can independently track its own current schema, default partition spec, and default sort order. These operations are isolated to the branch they run against and do not affect the table's main (master) branch or any other branch. This means Iceberg tables match the behavior Foundry users already expect from catalog datasets, where work on a branch is fully isolated, while still remaining compliant with the Iceberg REST catalog specification ↗.

This page explains how branch isolation works, how it extends default Iceberg behavior, and how it is automatically applied when using Iceberg inside Foundry. It also covers how to opt into this behavior when using external Iceberg clients.

:::callout{theme="neutral"} The branch isolation described on this page applies when running jobs through Foundry's build system. When writing to Foundry's Iceberg catalog from an external client, standard Iceberg branching behavior applies by default. Alternatively, you can opt into Foundry's branch isolation. See Using branches with external clients for details. :::

Branch isolation in Foundry

When running a build in Foundry, Foundry provides branch isolation for Iceberg tables. This means Foundry automatically isolates schema, partition spec, and sort order changes to the associated branch, ensuring these branch-scoped changes do not unintentionally modify other branches. Users do not need to take any action to configure this; it is the default behavior for writing to Iceberg tables using Foundry's build system.

:::callout{theme="neutral"} Note that the branch-scoped behavior described here for schemas also applies to partition spec and sort order, which (like schemas) can be evolved independently on a branch in Foundry. The branch-scoped metadata is stored as additional properties in your table's Iceberg metadata file. For example: "foundry.branch.feature-branch.schema-id" : "3". :::

Standard Iceberg behavior

In standard Iceberg, the schema is a table-level concept shared across all branches, where there is a single current schema tracked for the table across all branches. See Apache Iceberg documentation ↗ for details. This means that a modification to the table's schema on any branch will change the schema for main and every other branch.

Having a single schema across all branches makes it difficult to develop or validate schema changes in isolation on a branch before merging into production. To test schema changes on a branch, you would need to modify your table's global schema. Any such change, regardless of which branch it originates from, would immediately affect all consumers of the table on main.

Foundry's behavior

Foundry's Iceberg catalog solves this by automatically tracking a branch-scoped schema, partition spec, and sort order. A job running on a branch (for example, my-branch) sees the schema that that branch is tracking, not the schema on main. Schema changes made within that job on my-branch are stored for the associated branch (in this case, my-branch) and leave the current schema on main entirely unchanged.

The following table summarizes what is branch-scoped versus table-scoped and shared across all branches:

Metadata Foundry-extended branching Default Iceberg branching
Schema history Table-scoped Table-scoped
Table properties Table-scoped Table-scoped
Current snapshot Branch-scoped Branch-scoped
Current schema Branch-scoped Table-scoped
Default partition spec Branch-scoped Table-scoped
Default sort order Branch-scoped Table-scoped

Note that in Foundry-extended branching while the current schema is branch-scoped, the history of schemas is table-scoped. This means that when a branch adds a new column, that schema is added to the table's shared schema list, but is not set as the current schema on main.

How branch context is established

When a job runs on a branch in Foundry's build system, Foundry automatically injects the branch context into all catalog operations for that job. You do not need to configure anything as a user. Your pipelines read inputs and write outputs against the branch without any additional code.

Examples

The examples below illustrate the two most common scenarios: creating a new table on a branch, and evolving the schema of an existing table on a branch.

Example: Creating a new table on a branch

Suppose you are building a new table on a development branch feature/customer-scores. Your transform creates an output Iceberg table and writes initial data to it.

import pyarrow as pa
from transforms.api import transform
from transforms.tables import IcebergOutput, TableOutput


@transform.using(customer_scores=TableOutput("/<path>/customer_scores"))
def compute(customer_scores: IcebergOutput):
    data = pa.table(
        {
            "customer_id": pa.array([1, 2, 3], type=pa.int64()),
            "score": pa.array([0.91, 0.74, 0.88], type=pa.float64()),
        }
    )
    customer_scores.write_table(data)

Here is the state of the customer_scores table after writing to it on the feature branch:

Foundry-extended branching Default Iceberg branching
Schema on feature/customer-scores New schema New schema
Schema on main Empty New schema
Data on feature/customer-scores Exists Exists
Data on main None None

With Foundry's branch isolation: The branch owns the initial schema and snapshot. The table exists on main, but has an empty schema with no data.

With default Iceberg branches: The branch's schema appears on main.

Example: Evolving the schema of an existing table on a branch

Suppose you now have the table customer_scores already in production. You are testing some potential changes to the pipeline and want to add a new column and modify an existing column type, as part of a job running on branch feature/scoring-v2, without affecting what downstream production consumers see on main.

import pyarrow as pa
from transforms.api import transform
from transforms.tables import IcebergOutput, TableOutput


@transform.using(
    customer_scores=TableOutput("/<path>/customer_scores")
)
def compute(customer_scores: IcebergOutput):
    data = pa.table(
        {
            "customer_id": pa.array(["c1", "c2", "c3"], type=pa.string()),
            "score": pa.array([0.91, 0.74, 0.88], type=pa.float64()),
            "score_v2": pa.array([0.95, 0.80, 0.92], type=pa.float64()),
        }
    )
    customer_scores.write_table(data)

Here is the state of the table after the job runs on feature/scoring-v2:

Foundry-extended branching Default Iceberg branching
Schema on feature/scoring-v2 New schema New schema
Schema on main Old schema New schema
Data on feature/scoring-v2 Exists Exists
Data on main Unchanged Unchanged

With Foundry's branch isolation: The new column and type change are isolated to the branch. main retains its original schema, so downstream consumers are unaffected.

With default Iceberg branches: The branch and main share a single current schema, so the write either fails schema validation or forces you to update main's schema, exposing the change to downstream consumers.

Using branches with external clients

Foundry's Iceberg catalog is fully compliant with the Iceberg REST catalog specification ↗.

When accessing Foundry's Iceberg catalog from an external tool, the catalog's behavior depends on whether the X-PLTR-Branch header is set. Set the X-PLTR-Branch header when working with branches from external clients to ensure consistent Foundry branching behavior.

When the X-PLTR-Branch header is set:

  • All read and write operations are automatically scoped to the branch.
  • The branch set in the X-PLTR-Branch header is used by default when no branch qualifier is specified (or if main is referenced). If a specific (non-main) branch is provided in the request, then that branch will be used rather than the header branch. For example, if the header is set to featurebranch1, then SELECT * FROM table and SELECT * FROM table.branch_main will reference featurebranch1, but SELECT * FROM table.branch_featurebranch2 will reference featurebranch2.

When the X-PLTR-Branch header is not set, default Iceberg behavior applies:

  • Snapshot reads will use the schema associated with the referenced snapshot. For example, SELECT * FROM table VERSION AS OF <snapshot_id>.
  • Branch reads will use main's schema (table-scoped). For example, SELECT * FROM table.branch_my_branch.
  • Writes are validated against main's schema (table-scoped); writes that do not conform fail rather than adapting to the branch.
  • Schema changes apply to the table-scoped schema on main, and are not isolated to the branch.

Setting the X-PLTR-Branch header externally

To set the branch for a PyIceberg client, add the header to your catalog properties:

from pyiceberg.catalog import load_rest

catalog = load_rest(
    "foundry",
    {
        "uri": "https://your.foundry/iceberg",
        "token": "eyJwb...",
        "header.X-PLTR-Branch": "my-branch"
    },
)

For Spark, pass the header as a catalog configuration property:

spark = (
    SparkSession.builder
        ...
        .config("spark.sql.catalog.foundry.header.X-PLTR-Branch", "my-branch")
        ...
        .getOrCreate()
)

For more information on connecting external clients to Foundry's catalog, see Authenticating Iceberg clients and Example: Local Jupyter.

Limitations and considerations

Note the following considerations when working with branch isolation.

A single branch applies to all tables in a job

When a job runs on a branch, that branch context applies uniformly to all tables the job reads from and writes to. It is not currently possible to configure different branches for different input or output tables within the same job. Foundry validates that all resolved input branches match the job's branch, and raises an error if they do not.

Schemas and partition specs referenced by a branch are protected

If a branch is tracking a particular schema or partition spec, that schema or spec cannot be dropped from the table even by operations running on main. This prevents branches from being silently broken by cleanup or maintenance operations running concurrently. The protection is lifted automatically when the branch is deleted.

Iceberg branch merge procedures do not respect branch isolation

Iceberg's native branch merge procedures, such as cherry-pick and fast-forward, operate on snapshot references only and are not aware of Foundry's branch-scoped schema properties. If you use these procedures to merge a branch into main, the branch's schema will not be carried over, and main will retain its existing schema. To ensure schema changes on a branch are reflected on main, merge your code and rebuild on main through Foundry's build system.

Excluded REST endpoints

Some privileged REST endpoints cannot be called from Foundry branches, notably:

  • dropNamespace
  • updateNamespace
  • dropTable
  • renameTable
  • commitTransaction
  • createView
  • dropView
  • renameView

Version compatibility

Branch isolation is available from transforms-tables library version 0.1211.0 and greater.


中文翻译

Foundry Iceberg 目录中的分支隔离

Foundry Iceberg 目录扩展了标准 Iceberg 功能,使每个分支能够独立追踪其当前的 schema(模式)、默认分区规范(partition spec)和默认排序规则(sort order)。这些操作仅作用于所运行的分支,不会影响表的 mainmaster)分支或任何其他分支。这意味着 Iceberg 表的行为与 Foundry 用户对目录数据集(catalog datasets)的预期一致——分支上的工作完全隔离,同时仍然符合 Iceberg REST 目录规范 ↗

本文档解释了分支隔离的工作原理、如何扩展默认 Iceberg 行为,以及在 Foundry 内部使用 Iceberg 时如何自动应用该功能。同时介绍了在使用外部 Iceberg 客户端时如何选择启用此行为。

:::callout{theme="neutral"} 本文描述的分支隔离适用于通过 Foundry 构建系统运行作业的情况。当从外部客户端写入 Foundry Iceberg 目录时,默认应用标准 Iceberg 分支行为。或者,您也可以选择启用 Foundry 的分支隔离。详情请参见在外部客户端中使用分支。 :::

Foundry 中的分支隔离

在 Foundry 中运行构建时,Foundry 为 Iceberg 表提供分支隔离。这意味着 Foundry 会自动将 schema、分区规范和排序规则的变更隔离到相关分支,确保这些分支范围内的修改不会意外影响其他分支。用户无需进行任何配置;这是使用 Foundry 构建系统写入 Iceberg 表时的默认行为。

:::callout{theme="neutral"} 请注意,此处描述的 schema 分支范围行为同样适用于分区规范和排序规则,它们在 Foundry 中也可以独立在分支上进行演化。分支范围的元数据作为附加属性存储在表的 Iceberg 元数据文件中。例如:"foundry.branch.feature-branch.schema-id" : "3"。 :::

标准 Iceberg 行为

在标准 Iceberg 中,schema 是一个表级概念,在所有分支之间共享,整个表在所有分支上只追踪一个当前 schema。详情请参见 Apache Iceberg 文档 ↗。这意味着在任何分支上对表 schema 的修改都会改变 main 和所有其他分支的 schema。

所有分支共享单一 schema 使得在分支上独立开发和验证 schema 变更变得困难,无法在合并到生产环境前进行充分测试。要测试分支上的 schema 变更,您需要修改表的全局 schema。无论变更来自哪个分支,都会立即影响 main 上所有表的消费者。

Foundry 的行为

Foundry Iceberg 目录通过自动追踪分支范围的 schema、分区规范和排序规则解决了这一问题。在分支(例如 my-branch)上运行的作业看到的是该分支追踪的 schema,而不是 main 上的 schema。在 my-branch 上该作业中进行的 schema 变更会存储到关联分支(本例中为 my-branch),而 main 上的当前 schema 完全保持不变。

下表总结了哪些是分支范围的,哪些是表范围的并在所有分支间共享:

元数据 Foundry 扩展分支 默认 Iceberg 分支
Schema 历史 表范围 表范围
表属性 表范围 表范围
当前快照 分支范围 分支范围
当前 schema 分支范围 表范围
默认分区规范 分支范围 表范围
默认排序规则 分支范围 表范围

请注意,在 Foundry 扩展分支中,虽然当前 schema 是分支范围的,但 schema 的历史是表范围的。这意味着当分支添加新列时,该 schema 会被添加到表的共享 schema 列表中,但不会设置为 main 上的当前 schema。

分支上下文的建立方式

当作业在 Foundry 构建系统的某个分支上运行时,Foundry 会自动将分支上下文注入到该作业的所有目录操作中。用户无需进行任何配置。您的管道无需额外代码即可针对该分支读取输入和写入输出。

示例

以下示例说明了两种最常见的场景:在分支上创建新表,以及在分支上演化现有表的 schema。

示例:在分支上创建新表

假设您在开发分支 feature/customer-scores 上构建一个新表。您的转换创建了一个输出 Iceberg 表并写入初始数据。

import pyarrow as pa
from transforms.api import transform
from transforms.tables import IcebergOutput, TableOutput


@transform.using(customer_scores=TableOutput("/<path>/customer_scores"))
def compute(customer_scores: IcebergOutput):
    data = pa.table(
        {
            "customer_id": pa.array([1, 2, 3], type=pa.int64()),
            "score": pa.array([0.91, 0.74, 0.88], type=pa.float64()),
        }
    )
    customer_scores.write_table(data)

在功能分支上写入后,customer_scores 表的状态如下:

Foundry 扩展分支 默认 Iceberg 分支
feature/customer-scores 上的 schema 新 schema 新 schema
main 上的 schema 新 schema
feature/customer-scores 上的数据 存在 存在
main 上的数据

使用 Foundry 的分支隔离: 分支拥有初始 schema 和快照。表在 main 上存在,但 schema 为空且无数据。

使用默认 Iceberg 分支: 分支的 schema 会出现在 main 上。

示例:在分支上演化现有表的 schema

假设现在 customer_scores 表已在生产环境中。您正在测试管道的一些潜在变更,希望在分支 feature/scoring-v2 上运行的作业中添加新列并修改现有列类型,同时不影响 main 上的下游生产消费者。

import pyarrow as pa
from transforms.api import transform
from transforms.tables import IcebergOutput, TableOutput


@transform.using(
    customer_scores=TableOutput("/<path>/customer_scores")
)
def compute(customer_scores: IcebergOutput):
    data = pa.table(
        {
            "customer_id": pa.array(["c1", "c2", "c3"], type=pa.string()),
            "score": pa.array([0.91, 0.74, 0.88], type=pa.float64()),
            "score_v2": pa.array([0.95, 0.80, 0.92], type=pa.float64()),
        }
    )
    customer_scores.write_table(data)

作业在 feature/scoring-v2 上运行后,表的状态如下:

Foundry 扩展分支 默认 Iceberg 分支
feature/scoring-v2 上的 schema 新 schema 新 schema
main 上的 schema 旧 schema 新 schema
feature/scoring-v2 上的数据 存在 存在
main 上的数据 未改变 未改变

使用 Foundry 的分支隔离: 新列和类型变更被隔离到分支。main 保留其原始 schema,因此下游消费者不受影响。

使用默认 Iceberg 分支: 分支和 main 共享一个当前 schema,因此写入要么失败于 schema 验证,要么强制您更新 main 的 schema,从而将变更暴露给下游消费者。

在外部客户端中使用分支

Foundry Iceberg 目录完全符合 Iceberg REST 目录规范 ↗

当从外部工具访问 Foundry Iceberg 目录时,目录的行为取决于是否设置了 X-PLTR-Branch 头。在外部客户端中使用分支时,请设置 X-PLTR-Branch 头以确保一致的 Foundry 分支行为。

当设置了 X-PLTR-Branch 头时:

  • 所有读写操作自动限定在分支范围内。
  • 当未指定分支限定符(或引用 main)时,默认使用 X-PLTR-Branch 头中设置的分支。如果请求中提供了特定的(非 main)分支,则使用该分支而非头中指定的分支。例如,如果头设置为 featurebranch1,则 SELECT * FROM tableSELECT * FROM table.branch_main 将引用 featurebranch1,但 SELECT * FROM table.branch_featurebranch2 将引用 featurebranch2

当未设置 X-PLTR-Branch 头时,应用默认 Iceberg 行为:

  • 快照读取将使用关联快照的 schema。例如,SELECT * FROM table VERSION AS OF <snapshot_id>
  • 分支读取将使用 main 的 schema(表范围)。例如,SELECT * FROM table.branch_my_branch
  • 写入操作会针对 main 的 schema(表范围)进行验证;不符合的写入会失败,而不会适配到分支。
  • Schema 变更应用于 main 上的表范围 schema,不会隔离到分支。

在外部设置 X-PLTR-Branch

要为 PyIceberg 客户端设置分支,请将头添加到您的目录属性中:

from pyiceberg.catalog import load_rest

catalog = load_rest(
    "foundry",
    {
        "uri": "https://your.foundry/iceberg",
        "token": "eyJwb...",
        "header.X-PLTR-Branch": "my-branch"
    },
)

对于 Spark,将头作为目录配置属性传递:

spark = (
    SparkSession.builder
        ...
        .config("spark.sql.catalog.foundry.header.X-PLTR-Branch", "my-branch")
        ...
        .getOrCreate()
)

有关将外部客户端连接到 Foundry 目录的更多信息,请参见认证 Iceberg 客户端示例:本地 Jupyter

限制与注意事项

使用分支隔离时请注意以下事项。

单个分支适用于作业中的所有表

当作业在某个分支上运行时,该分支上下文统一应用于作业读取和写入的所有表。目前无法在同一作业中为不同的输入或输出表配置不同的分支。Foundry 会验证所有解析的输入分支是否与作业的分支匹配,如果不匹配则会报错。

分支引用的 schema 和分区规范受到保护

如果某个分支正在追踪特定的 schema 或分区规范,即使是在 main 上运行的操作也无法从表中删除该 schema 或规范。这可以防止并发运行的清理或维护操作静默破坏分支。当分支被删除时,保护会自动解除。

Iceberg 分支合并过程不尊重分支隔离

Iceberg 原生的分支合并过程(如 cherry-pick 和 fast-forward)仅操作快照引用,不了解 Foundry 的分支范围 schema 属性。如果使用这些过程将分支合并到 main,分支的 schema 将不会被携带,main 将保留其现有 schema。为确保分支上的 schema 变更反映到 main 上,请合并您的代码并通过 Foundry 构建系统在 main 上重新构建。

排除的 REST 端点

某些特权 REST 端点无法从 Foundry 分支调用,主要包括:

  • dropNamespace
  • updateNamespace
  • dropTable
  • renameTable
  • commitTransaction
  • createView
  • dropView
  • renameView

版本兼容性

分支隔离功能从 transforms-tables 库版本 0.1211.0 开始可用。