Retention policy execution(保留策略执行)¶
To execute the configured policies, all datasets are continuously evaluated in a loop. For each dataset, each policy is tested on all its transactions to determine which transactions need to be deleted. Depending on the number of datasets in an environment, this loop can take up to a few days to complete.
Transaction-level marking¶
Retention policies mark transactions for deletion at the transaction level, and evaluate each branch of a dataset independently. By default, the latest view on any branch will not be marked for deletion. A view starts at the first transaction on a branch; a later SNAPSHOT transaction starts a new view, and every transaction after that SNAPSHOT belongs to the new latest view. If a policy runs on a branch and a transaction falls within that branch's latest view, it cannot be marked for deletion on that branch.
Demonstration¶
As a single transaction can exist on multiple branches, the same transaction can simultaneously be in the latest view on one branch and outside it on another. Consider the following history, where branch xyz forks from abc at T2:
- Branch
abc:T1(APPEND) →T2(APPEND) →T3(SNAPSHOT) →T4(APPEND) - Branch
xyz:T1(APPEND) →T2(APPEND) →T5(APPEND)
On branch abc, the latest view starts at T3, so T1 and T2 are outside the latest view and are eligible to be marked. On branch xyz, there is no SNAPSHOT, so the latest view starts at T1 and contains every transaction — T1 and T2 are protected there. A retention policy running on abc would not mark T1 or T2, because the same transactions are still in xyz's latest view.
:::callout{theme="danger" title="Data loss risk with latest view deletion"}
You can override the default behavior by enabling the Allow deletion from latest view option in the Advanced options section of a retention policy. However, enabling this option may result in the deletion of current data that is still in use, and may cause incremental builds to fail. If you are using Python transforms with incremental pipelines, you must set the allow_retention=True parameter on the @incremental() decorator to prevent retention-related deletions from triggering a snapshot run.
:::
Learn more about configuring the latest view transaction deletion flag.
If multiple policies would select the same transactions, the first policy executed will just delete the transactions in question and any subsequent policies will ignore them.
Mark and sweep¶
When a policy determines that a transaction should be deleted, it is first marked for deletion. Marking a transaction conveys that the data in the transaction may be deleted at any point, and so should not be read.
A marked transaction is indicated by the message "This transaction has been scheduled for deletion." visible in the dataset history page when a specific transaction is selected.
After a certain duration (usually 7 days, but this may vary) from the time of marking, a transaction will be swept. At this point, the data in the transaction will be deleted and cannot be recovered.
A swept transaction is indicated by the message "Transaction data has been deleted." visible in the dataset history page when the specific transaction is selected.
As a marked but unswept transaction still contains data, it is possible to unmark a given transaction in the event that it was incorrectly marked. To do so, first amend the policy or policies that mistakenly marked the transaction. Then, contact your Palantir representative within 7 days to assist with unmarking the transaction as the data will not be recoverable outside of this timeframe.
中文翻译¶
保留策略执行¶
为执行已配置的策略,所有数据集会在一个循环中持续被评估。对于每个数据集,每条策略都会在其所有事务上进行测试,以确定哪些事务需要被删除。根据环境中数据集的数量,此循环可能需要数天才能完成。
事务级标记¶
保留策略在事务级别标记待删除的事务,并独立评估数据集的每个分支。默认情况下,任何分支上的最新视图都不会被标记为删除。视图从分支上的第一个事务开始;后续的 SNAPSHOT 事务会开启一个新视图,而该 SNAPSHOT 之后的所有事务都属于这个新的最新视图。如果策略在某个分支上运行,且某个事务属于该分支的最新视图,则该事务在该分支上不能被标记为删除。
示例说明¶
由于单个事务可能存在于多个分支上,同一事务可能同时处于一个分支的最新视图中,而在另一个分支上则不在最新视图中。考虑以下历史记录,其中分支 xyz 从 abc 的 T2 处分叉:
- 分支
abc:T1(APPEND) →T2(APPEND) →T3(SNAPSHOT) →T4(APPEND) - 分支
xyz:T1(APPEND) →T2(APPEND) →T5(APPEND)
在分支 abc 上,最新视图从 T3 开始,因此 T1 和 T2 不在最新视图中,符合标记条件。在分支 xyz 上,由于没有 SNAPSHOT,最新视图从 T1 开始并包含所有事务——T1 和 T2 在此受到保护。在 abc 上运行的保留策略不会标记 T1 或 T2,因为这些事务仍然存在于 xyz 的最新视图中。
:::callout{theme="danger" title="删除最新视图的数据丢失风险"}
您可以通过在保留策略的高级选项部分启用允许从最新视图中删除选项来覆盖默认行为。然而,启用此选项可能导致当前仍在使用的数据被删除,并可能使增量构建失败。如果您使用带有增量管道的 Python 转换,则必须在 @incremental() 装饰器 上设置 allow_retention=True 参数,以防止与保留相关的删除触发快照运行。
:::
了解更多关于配置最新视图事务删除标志的信息。
如果多条策略会选择相同的事务,则首先执行的策略会删除相关事务,后续的任何策略将忽略这些事务。
标记与清理¶
当策略确定某个事务应被删除时,该事务首先会被标记为待删除。标记事务意味着该事务中的数据可能随时被删除,因此不应再被读取。
被标记的事务会在数据集历史页面中显示消息"此事务已计划删除"(当选中特定事务时可见)。
从标记时刻起经过一定时间(通常为 7 天,但可能有所不同)后,事务将被清理。此时,事务中的数据将被删除且无法恢复。
被清理的事务会在数据集历史页面中显示消息"事务数据已被删除"(当选中特定事务时可见)。
由于已标记但尚未清理的事务仍包含数据,因此如果某个事务被错误标记,可以取消标记。为此,首先修改错误标记该事务的策略。然后,在 7 天内联系您的 Palantir 代表以协助取消标记该事务,因为超出此时间范围后数据将无法恢复。