跳转至

Guidance on removing markings(移除标记(Markings)指南)

Access requirements for platform resources are controlled by Markings. Markings restrict access in an all-or-nothing fashion: in order to access a resource, a user must be a member of all Markings applied to the resource. In addition, Markings are inherited along both the file hierarchy and direct dependencies. If you have the correct permissions, you can remove Markings directly from resources and along direct dependencies.

Markings are frequently used because they are legible throughout the platform and propagate along direct dependencies, protecting sensitive data. In some circumstances, a Marking may be applied early in a pipeline and need to be removed later in a pipeline. This page provides more information on how to remove Markings depending on your pipeline structure.

You can also remove markings or organizations with Pipeline Builder.

Scenarios

Below are three scenarios related to applying a Marking early in a pipeline and removing it later in a pipeline.

Scenario 1: Replacing severing by Marking removal

This scenario is for when the pipeline:

  • already has a Marking applied,
  • the Marking is removed by severing, and
  • the Project-level propagate view requirements permissions are not turned on.

Therefore, you are migrating from using severing (a deprecated feature) to removing an inherited Marking.

scenario1

In the old state shown in the example above, severing has been used to prevent the Marking from propagating. Assuming that severing is only being used to remove a Marking, we strongly recommend that you replace severing with Marking removal, as in the new state in the example above. When removing the Marking, it is useful to think about the approval mode configuration of the repository which contains the Marking removal transform.

In the case that propagate view requirements is enabled, read scenario 2 below.

Scenario 2: Apply a Marking followed by Marking removal to disable "Propagate View Requirements"

This scenario involves applying Markings to a dataset in your pipeline in order to disable the project-level propagate view requirements settings.

propagate_view_requirement_off

New Projects have the Propagate View Requirements option disabled by default, as seen above. For these new Projects, view requirements will not be enforced for downstream derived datasets. Specifically, this means that users accessing a downstream version of the data in a separate Project would not also require access to the upstream data in the Projects where this configuration is disabled.

:::callout{title="Reminder"} Markings always propagate. If data in a new Project has a Marking, that Marking will still propagate to all downstream datasets, regardless of the "propagate view requirements" setting. :::

propagate_view_requirement_on

If you have Projects with the Propagate View Requirements option enabled as in the image above, then view requirements have propagated for datasets in these Projects. This means that users accessing a downstream version of the data in a separate Project would additionally require access to the upstream Project(s) with this config enabled.

We highly recommend disabling view requirement propagation in favor of using Markings.

Before disabling view requirement propagation and introducing Markings to your pipeline, it is worth considering the original purpose of enabling propagating view requirements:

  1. Perhaps there is no compelling reason for having "propagate view requirements" enabled. In that case, you might be able to simply disable this setting, and then ensure that users are only granted access to the datasets to which they explicitly require access. Users will no longer also require access to the upstream datasets which were previously propagating view requirements.
  2. There are a few sensitive datasets in the Project. In this case, it’s best to isolate the sensitive datasets from non-sensitive ones and follow the steps outlined below. Those steps demonstrate how to replace view requirement propagation with Markings, which are the appropriate security primitive meant for these scenarios.
  3. If the steps below do not meet your needs, contact your Palantir representative.

In the example below, our goal is to disable "propagate view requirements" on the Datasource Project. After following the steps above, we learned that the reason "propagate view requirements" is enabled on the project was to protect the raw_dataset_1 dataset because it has sensitive data.

scenario2

In the old state, viewing the contents of Dataset A would require at least “viewer” access on both the Datasource Project and the Downstream Project. Subsequently, severing has been used to remove the view requirement propagation on Dataset B.

In the new state, viewing contents of Dataset A requires at least “viewer” access on the Downstream Project and access to the Marking. Note that with "propagate view requirements" disabled, requiring access to Datasets C & D only requires “viewer” access on the Downstream Project.

This change allows disabling of "propagate view requirements" by using a security Marking. In the proposed solution below, we’ll apply a Marking on raw_dataset_1 which is then immediately propagated to all downstream datasets which have any non-severed transaction of raw_dataset_1 as an input. There is an assumption that severing was already in place, and severing is only being replaced by Marking removal. If this is not true in your situation, see Scenario 3, where we discuss implications of applying a Marking to a dataset in detail.

The following steps are recommended for introducing this change so that users don’t lose access to Dataset B when the Marking is added after disabling "propagate view requirements" in the Datasource Project:

  1. Create a new Marking and give relevant users access to the Marking.
  2. Apply the Marking on raw_dataset_1.
  3. Now that the Marking has been applied, you can safely disable "propagate view requirements" in the Datasource Project.
  4. Replace severing with Marking removal in the transform, and ensure that unsevering and Marking removal happens at the same time.
  5. Ensure all datasets in the pipeline are built.
  6. Remove users from the Datasource Project who no longer require access to it.

Scenario 3: Applying a new Marking followed by Marking removal at the dataset level

This potentially complex scenario involves introducing a new Marking early in an existing pipeline without accidentally locking out users later in the pipeline.

scenario3

It is critical to note that the Marking introduced on Dataset A will immediately propagate to all resources that are downstream of that dataset along the transaction lineage. Users will require the marking to access anything derived from the marked dataset.

To understand this better, let’s extend the example above with a pipeline as follows: Dataset A → Dataset BDownstream Datasets:

  • Dataset A is a raw dataset (about to be marked).
  • Dataset B is derived from Dataset A with sensitive data removed (so the Marking can be removed).
  • Downstream Datasets are datasets derived from Dataset B for which users need access.

Our goal in this example is to ensure that Downstream Datasets never inherit the Marking from Dataset A.

The first thing we need to understand is that marking Dataset A (or a folder enclosing Dataset A) effectively marks all transactions in the entire history of Dataset A. As a consequence, Dataset B and Downstream Datasets inherit the Marking immediately.

If we perform the following steps:

  • Add Marking removal in the Dataset A → Dataset B transform,
  • Update Dataset B’s code, and
  • Rebuild Dataset B ...

... then the latest snapshot transaction on Dataset B will be Marking-free, but all older transactions on Dataset B will still be marked.

:::callout{theme="neutral"} Note that while marking a dataset will mark all of its transactions, removing a Marking in transforms will only remove the Marking for new output transactions. Markings will not be removed from existing transactions. This behavior is not symmetrical. :::

This means that any data in the Downstream Datasets derived from an older transaction on Dataset B, such as Downstream datasets that are built incrementally, will still inherit the Marking. However, Foundry checks only the most recent transaction for each input when verifying permissions on a dataset view. For incremental datasets, this means a regular build is sufficient to remove the marking.

To ensure each incremental Downstream dataset is unmarked, everything between the unmarking transform on Dataset B and the incremental Downstream dataset must be rebuilt after the unmarking transform is applied on Dataset B. This will ensure that each Downstream dataset depends only on the latest (unmarked) transaction on Dataset B, rather than earlier (marked) transactions.

If the number of Downstream datasets is infeasible for manually triggering a rebuild, we suggest the following steps:

scenario3_2

  1. Create a Dataset A′ and make sure its contents are identical to Dataset A.
  2. Rewrite the code for Dataset B to use Dataset A′ as an input in place of Dataset A. At the same time, make sure to add appropriate Marking removal to the transform and then build Dataset B immediately afterward. There is no need to snapshot build Dataset B.

:::callout{theme="neutral"} Consider the performance effects of doing this swap as it could trigger a SNAPSHOT build of Dataset B. :::

  1. Mark Dataset A′.
  2. Move Dataset A into trash.
  3. No further rebuilds are required at this point.

If you are making these changes in an important pipeline or are unsure about any of these steps, contact your Palantir representative for assistance.

Best practices

  • Do not set a repo to “Don’t require re-approval” mode unless absolutely necessary.
  • If you must enable this mode, do minimize the set of editors of that repository and do ensure that you get a sign-off (if required) for this setting.
  • Do have clear, well-defined criteria for what makes a resource require a Marking.
  • Do write Marking removal transforms to actively select data that does not require a Marking
  • Do not write Marking removal transforms that filter out data which requires a Marking. This is because new data which merits a Marking can commonly be introduced upstream, and you need to protect against it accidentally flowing through your Marking removal transform.
  • Examples showing “filter in” approach (recommended):
    # column-based
    df.select("salary","title","department")
    
    # row-based
    states_to_keep =["OH","CA","DE"]
    df.filter(df.state.isin(states_to_keep))
    
  • Examples showing “filter out” approach (not recommended):
    # column-based
    df.drop("firstname","lastname")
    
    # row-based
    states_to_drop =["FL","TX","IL"]
    df.filter(~df.state.isin(states_to_drop))
    
  • Do perform the Marking removal in the transform where sensitive data removal logic is implemented.
  • Do not abstract or hide this logic in a different repository.
  • Similarly, do not create a separate Markings removal repository (with a suite of identity transforms). The marking removal logic should be explicitly approved during the Marking removal PR review.
  • Do remove Marking from the data either as far upstream (when a marked dataset is added as an import) or as far downstream (in the last step before creating a Project export) as possible while performing Marking removal within a Project.

FAQs

Do we need an approval for the Marking removal every time there is any logic changes in the code?

This depends on whether the require re-approval or don't require re-approval approval mode is set on the repository which has your Marking removal transform. Learn more about approval modes.

The Marking removal workflow and the “one repo per Project” recommendations don’t go very well together. How many repositories should we set up per Project for the Marking removal workflow?

Ideally, every time the logic of a Marking removal transform changes, it should undergo a security approval. To balance between excessive friction in the approval process and good security posture, we recommend that, if you can, move all transforms with Marking removal logic (such as obfuscating data, removing columns, and so on) to a separate repo and set the separate repo to “Require re-approvals”.

Can I add the Marking removal properties (stop_propagating and stop_requiring) on an output in my transforms?

No, these are input properties and cannot be added on outputs. If your goal is to remove a Marking from a certain output, you need to identify all inputs that carry a Marking and add stop_propagating statements to them respectively. For more details, refer to the input transform property documentation.

Which languages support Marking removal?

The following languages support Marking removal:

Why is Marking removal preferred over severing?

Declassification should be carried out carefully and not scattered around Projects and repositories. Marking removal features in the platform provide granular control over permission propagation changes and ensure that such changes are appropriately reviewed.

We are transitioning repos to use Marking removal instead of severing. How can we disable severing for new transforms?

Contact your Palantir representative to disallow adding severing on datasets that have not had severing enabled before.


中文翻译

移除标记(Markings)指南

平台资源的访问要求由标记(Markings)控制。标记以全有或全无的方式限制访问:要访问某个资源,用户必须是应用于该资源的所有标记的成员。此外,标记会沿文件层次结构和直接依赖关系继承。如果您拥有正确的权限,可以直接从资源以及沿直接依赖关系移除标记。

标记被广泛使用,因为它们在平台中清晰可见,并沿直接依赖关系传播,从而保护敏感数据。在某些情况下,标记可能在管道(Pipeline)的早期应用,而需要在后期移除。本页提供了有关如何根据管道结构移除标记的更多信息。

您还可以通过 Pipeline Builder 移除标记或组织

场景

以下是与在管道早期应用标记并在后期移除相关的三个场景。

场景 1:用标记移除替代切断(Severing)

此场景适用于以下管道(Pipeline)情况:

  • 已经应用了标记,
  • 通过切断(Severing)移除了标记,并且
  • 未开启项目(Project)级别的传播视图要求(Propagate View Requirements)权限。

因此,您正在从使用切断(一项已弃用的功能)迁移到移除继承的标记。

scenario1

在上图所示的旧状态中,已使用切断来防止标记传播。假设切断仅用于移除标记,我们强烈建议您用标记移除替代切断,如上图新状态所示。在移除标记时,考虑包含标记移除转换(Transform)的代码库(Repository)的审批模式配置会很有帮助。

如果启用了传播视图要求,请阅读下方的场景 2。

场景 2:应用标记后移除标记以禁用“传播视图要求”

此场景涉及在管道中的数据集(Dataset)上应用标记,以禁用项目级别的传播视图要求设置。

propagate_view_requirement_off

如上图所示,新项目默认禁用“传播视图要求”选项。对于这些新项目,不会对下游派生数据集强制执行视图要求。具体而言,这意味着在单独项目中访问下游版本数据的用户,无需同时访问禁用此配置的项目中的上游数据。

:::callout{title="提醒"} 标记始终会传播。如果新项目中的数据带有标记,无论“传播视图要求”设置如何,该标记仍会传播到所有下游数据集。 :::

propagate_view_requirement_on

如果您的项目启用了“传播视图要求”选项(如上图所示),则这些项目中的数据集已传播视图要求。这意味着在单独项目中访问下游版本数据的用户,还需要额外访问启用了此配置的上游项目。

我们强烈建议禁用 视图要求传播,转而使用标记

在禁用视图要求传播并将标记引入管道之前,值得考虑最初启用传播视图要求的目的:

  1. 也许没有令人信服的理由启用“传播视图要求”。在这种情况下,您可以直接禁用此设置,然后确保仅向用户授予他们明确需要访问的数据集的权限。用户将不再需要访问之前传播视图要求的上游数据集。
  2. 项目中有几个敏感数据集。在这种情况下,最好将敏感数据集与非敏感数据集隔离,并遵循以下步骤。这些步骤演示了如何用标记替换视图要求传播,标记是专为这些场景设计的合适安全原语。
  3. 如果以下步骤不能满足您的需求,请联系您的 Palantir 代表。

在下面的示例中,我们的目标是在Datasource项目上禁用“传播视图要求”。按照上述步骤,我们了解到该项目启用“传播视图要求”的原因是为了保护 raw_dataset_1 数据集,因为它包含敏感数据。

scenario2

在旧状态中,查看数据集 A 的内容至少需要对Datasource项目和Downstream项目具有“查看者”访问权限。随后,使用切断移除了数据集 B 的视图要求传播。

在新状态中,查看数据集 A 的内容至少需要对Downstream项目具有“查看者”访问权限,并具有该标记的访问权限。请注意,在禁用“传播视图要求”的情况下,访问数据集 CD 仅需对Downstream项目具有“查看者”访问权限。

此更改允许通过使用安全标记来禁用“传播视图要求”。在下方提议的解决方案中,我们将在 raw_dataset_1 上应用一个标记,该标记随后会立即传播到所有下游数据集,这些数据集将 raw_dataset_1 的任何未切断事务(Transaction)作为输入。这里假设切断已经就位,并且切断仅被标记移除所替代。如果您的情况并非如此,请参阅场景 3,其中我们详细讨论了在数据集上应用标记的影响。

建议采取以下步骤来引入此更改,以便在Datasource项目中禁用“传播视图要求”后添加标记时,用户不会失去对数据集 B 的访问权限:

  1. 创建一个新标记,并授予相关用户对该标记的访问权限。
  2. raw_dataset_1 上应用该标记。
  3. 标记应用完成后,您现在可以安全地在Datasource项目中禁用“传播视图要求”。
  4. 在转换中用标记移除替代切断,并确保取消切断和标记移除同时发生。
  5. 确保构建管道中的所有数据集。
  6. Datasource项目中移除不再需要访问权限的用户。

场景 3:在数据集级别应用新标记后移除标记

这个可能较为复杂的场景涉及在现有管道的早期引入新标记,而不会意外锁定管道后期的用户。

scenario3

必须注意的是,在数据集 A 上引入的标记将立即沿事务血缘传播到该数据集下游的所有资源。用户需要该标记才能访问从已标记数据集派生的任何内容。

为了更好地理解这一点,让我们用以下管道扩展上述示例:数据集 A → 数据集 BDownstream 数据集:

  • 数据集 A 是一个原始数据集(即将被标记)。
  • 数据集 B 是从数据集 A 派生的,已移除敏感数据(因此可以移除标记)。
  • Downstream 数据集是从数据集 B 派生的数据集,用户需要访问这些数据集。

在此示例中,我们的目标是确保 Downstream 数据集永远不会从数据集 A 继承标记。

我们首先需要了解的是,标记数据集 A(或包含数据集 A 的文件夹)实际上会标记数据集 A 整个历史记录中的所有事务。因此,数据集 BDownstream 数据集会立即继承该标记。

如果我们执行以下步骤:

  • 在数据集 A → 数据集 B 的转换中添加标记移除,
  • 更新数据集 B 的代码,并
  • 重新构建数据集 B ...

...那么数据集 B 上的最新快照(Snapshot)事务将不带标记,但数据集 B 上的所有旧事务仍会被标记。

:::callout{theme="neutral"} 请注意,虽然标记数据集会标记其所有事务,但在转换中移除标记只会移除新输出事务的标记。现有事务的标记不会被移除。这种行为不是对称的。 :::

这意味着 Downstream 数据集中从数据集 B 的旧事务派生的任何数据(例如增量构建的 Downstream 数据集)仍将继承该标记。 但是,Foundry 在验证数据集视图的权限时,仅检查每个输入的最新事务。对于增量数据集,这意味着常规构建足以移除标记。

为确保每个增量 Downstream 数据集都未标记,在数据集 B 上应用取消标记转换后,必须重新构建数据集 B 上的取消标记转换与增量 Downstream 数据集之间的所有内容。这将确保每个 Downstream 数据集仅依赖于数据集 B 上的最新(未标记)事务,而不是早期的(已标记)事务。

如果 Downstream 数据集的数量太多而无法手动触发重新构建,我们建议采取以下步骤:

scenario3_2

  1. 创建数据集 A′ 并确保其内容与数据集 A 完全相同。
  2. 重写数据集 B 的代码,使用数据集 A′ 作为输入来替代数据集 A。同时,确保在转换中添加适当的标记移除,然后立即构建数据集 B。无需对数据集 B 进行快照构建。

:::callout{theme="neutral"} 请考虑执行此交换对性能的影响,因为它可能会触发数据集 BSNAPSHOT 构建。 :::

  1. 标记数据集 A′
  2. 将数据集 A 移至回收站。
  3. 此时无需进一步重新构建。

如果您在重要管道中进行这些更改,或对任何步骤不确定,请联系您的 Palantir 代表以获取协助。

最佳实践

  • 切勿将代码库设置为“不需要重新审批”模式,除非绝对必要。
  • 如果必须启用此模式,务必最小化该代码库的编辑者集合,并务必确保获得此设置的签字批准(如果需要)。
  • 务必制定清晰、明确的标准,规定哪些资源需要标记。
  • 务必编写标记移除转换,以主动选择不需要标记的数据。
  • 切勿编写过滤掉需要标记的数据的标记移除转换。这是因为通常需要标记的新数据可能会在上游引入,您需要防止其意外流经您的标记移除转换。
  • 显示“过滤保留”方法(推荐)的示例:
    # 基于列
    df.select("salary","title","department")
    
    # 基于行
    states_to_keep =["OH","CA","DE"]
    df.filter(df.state.isin(states_to_keep))
    
  • 显示“过滤移除”方法(推荐)的示例:
    # 基于列
    df.drop("firstname","lastname")
    
    # 基于行
    states_to_drop =["FL","TX","IL"]
    df.filter(~df.state.isin(states_to_drop))
    
  • 务必在实现敏感数据移除逻辑的转换中执行标记移除。
  • 切勿将此逻辑抽象或隐藏在不同的代码库中。
  • 同样,切勿创建单独的标记移除代码库(包含一组身份转换)。标记移除逻辑应该在标记移除 PR 审查期间被明确批准。
  • 务必项目内执行标记移除时,尽可能在上游(当已标记数据集作为导入添加时)或尽可能在下游(在创建项目导出前的最后一步)从数据中移除标记。

常见问题解答

每次代码逻辑发生更改时,我们都需要对标记移除进行审批吗?

这取决于包含标记移除转换的代码库是否设置了需要重新审批不需要重新审批的审批模式。了解有关审批模式的更多信息。

标记移除工作流与“每个项目一个代码库”的建议不太契合。对于标记移除工作流,我们应该为每个项目设置多少个代码库?

理想情况下,每次标记移除转换的逻辑发生变化时,都应进行安全审批。为了在审批过程中的过度摩擦与良好的安全态势之间取得平衡,我们建议,如果可能的话,将所有具有标记移除逻辑的转换(如混淆数据、移除列等)移至单独的代码库,并将该单独代码库设置为“需要重新审批”。

我可以在转换的输出上添加标记移除属性(stop_propagatingstop_requiring)吗?

不可以,这些是输入属性,不能添加到输出上。如果您的目标是从特定输出中移除标记,您需要识别所有带有标记的输入,并分别向它们添加 stop_propagating 语句。有关更多详细信息,请参阅输入转换属性文档。

哪些语言支持标记移除?

以下语言支持标记移除:

为什么标记移除比切断更受青睐?

降级处理应谨慎进行,不应分散在各个项目和代码库中。平台中的标记移除功能提供了对权限传播更改的细粒度控制,并确保此类更改得到适当审查。

我们正在将代码库过渡为使用标记移除而不是切断。如何为新转换禁用切断?

请联系您的 Palantir 代表,以禁止对之前未启用切断的数据集添加切断。