跳转至

Advanced details(高级细节)

This page covers advanced details about how dataset projections work in practice to enable more optimized queries on datasets. To learn about projections at a higher level, see this page.

Internally, a projection is a copy of a dataset, optimized for some access pattern. Foundry stores projections on a dataset as child datasets. These are called "projection" datasets. The parent datasets are called "canonical" or "projected" datasets.

  • A projection includes some set of columns.
  • The projection can only satisfy reads on those columns (or a subset of them).

A Build keeps the projection up to date with the most recent data in the main dataset. If a projection is not up to date, it will still be used. However, it might not provide much benefit.

If the projected dataset receives a new SNAPSHOT transaction, any downstream projections are entirely out of date and have no benefit until the projects rebuild. If a projected dataset receives an APPEND transaction, downstream projections are only partially out of date relative to the new transaction. Foundry queries are rewritten to benefit from the projection if they can while still producing results that reflect the new data.

  • Projection builds are configured per-branch. Projections will be regularly compacted to combine smaller partitions into larger ones.
  • Adding a projection will never change the result of a read of a dataset.

At a low level, a projection is either:

  • An approximately globally sorted dataset.
  • A hash bucketed and locally sorted dataset (note that the bucketing and sort columns may differ).

Projection datasets

A projection dataset is stored as a Foundry dataset. This dataset is not visible as a resource but can be accessed via the link in the Projections tab.

  • Each branch for which a projection build is active has a corresponding branch in the projection dataset.
  • Deleting a dataset will delete all projections on the dataset.
  • The Noho service is used to manage projections and is referenced in the dataset schema when setting up a projection.

Projection builds

To keep them up to date, projections are built asynchronously through the normal Foundry build system. This lets users read projected datasets consistently and immediately after a build, but the projection datasets must be built periodically to keep them from becoming out of date.

To allow flexibility in allocating compute resources and controlling costs, Foundry will not automatically create these builds. To configure them, use the scheduler widget in the Projections tab.

There is no universal rule for the appropriate build cadence. The primary determining factor is that queries need to be able to execute within their performance targets on the unprojected portion of the dataset. For example, if your pipeline writes 10 GB per hour, and you have determined that a filtered read should scan no more than 100 GB to meet your performance targets, you should make sure that the projections build at least every 10 hours.

Spark profiles

Projections use an auto-scaling mechanism to find the right number of executors to build a projection. You do not need to manually adjust Spark profiles unless projection builds fail or take too long.

Costs

Foundry will attribute any costs associated to the projection (for example its storage and compute) to the project of the main dataset.

Projection compaction

Compaction is the primary maintenance operation performed on projections. It refers to the process of taking large collections of small sorted files, and combining them into larger sorted files Compaction occurs automatically on projections as a part of the projection build process.

Compaction makes read performance independent of the number of input transactions on the main dataset. This allows projections to speed up reads of frequently incrementally written or streaming datasets. Projection builds might occasionally run for longer than average. This is usually due to a compaction.

Projection query planning

If a projection is available to satisfy a query, it will always be chosen ahead of the main dataset, even if the main dataset is written in a way that would otherwise be more optimal to support a given query. This greatly simplifies the semantics around query planning.

Choosing projections

For the following queries, here is the priority assigned to various projections during query planning:

  • A filter on columns: The projection sorted on the most columns will be preferred, and globally sorted projections are preferred to locally sorted ones. For example, if the filter is x = 1 AND y = 2, projections will be selected in the following priority:
  • Globally sorted on columns x and y
  • Locally sorted on columns x and y (and bucketed on any set of columns)
  • Globally sorted on column x
  • Locally sorted on column x (and bucketed on any set of columns)
  • A join on columns: A projection bucketed on the exact set of join columns will be preferred.
  • A join and a filter: For example, if the query filters on column F and joins on column J, projections will be preferred according to the following priority:
  • A projection bucketed on J and locally sorted on column F
  • A projection globally sorted on column F
  • A projection locally sorted on column F (and bucketed on any column other than column J)
  • A projection bucketed on column J (and locally sorted on anything other than column F)

These priorities reflect the view that filters are typically selective enough so that it is better to optimize for the filter versus the join, though this may not always be the case.


中文翻译


高级细节

本文档详细介绍了数据集投影(Projection)在实际应用中的工作原理,以支持对数据集进行更优化的查询。如需从更高层面了解投影,请参阅此页面

在内部实现中,投影是数据集的一个副本,针对特定访问模式进行了优化。Foundry 将投影作为子数据集存储在数据集中,这些子数据集称为"投影"数据集,而父数据集则称为"规范(Canonical)"或"被投影(Projected)"数据集。

  • 投影包含某些列集合。
  • 投影只能满足对这些列(或其子集)的读取操作。

构建(Build) 操作可使投影与主数据集中的最新数据保持同步。如果投影未及时更新,它仍会被使用,但可能无法提供显著效益。

如果被投影的数据集收到新的 SNAPSHOT 事务,所有下游投影将完全过时,在重新构建前无法提供任何效益。如果被投影的数据集收到 APPEND 事务,下游投影仅相对于新事务部分过时。Foundry 会重写查询,在可能的情况下利用投影优势,同时确保查询结果反映最新数据。

  • 投影构建按分支(Branch)进行配置。投影会定期进行压缩(Compaction),将较小的分区合并为较大的分区。
  • 添加投影绝不会改变数据集读取的结果。

在底层实现中,投影分为两种类型:

  • 近似全局排序的数据集。
  • 哈希分桶(Hash Bucketed)且局部排序的数据集(注意:分桶列和排序列可能不同)。

投影数据集

投影数据集以 Foundry 数据集的形式存储。该数据集不会作为资源(Resource)可见,但可通过"投影(Projections)"选项卡中的链接进行访问。

  • 每个启用了投影构建的分支,在投影数据集中都有对应的分支。
  • 删除数据集将同时删除该数据集上的所有投影。
  • Noho 服务用于管理投影,在设置投影时,该服务会在数据集模式(Schema)中被引用。

投影构建

为保持投影的实时性,投影会通过常规的 Foundry 构建(Build)系统异步构建。这使得用户能够在构建后立即一致地读取被投影的数据集,但投影数据集必须定期构建,以防止其过时。

为灵活分配计算资源并控制成本,Foundry 不会自动创建这些构建。如需配置构建,请使用"投影"选项卡中的调度器(Scheduler)组件。

构建频率没有通用规则。主要决定因素在于:查询需要在未投影的数据集部分上,能够在性能目标范围内执行。例如,如果您的管道每小时写入 10 GB 数据,并且您确定过滤读取最多扫描 100 GB 即可满足性能目标,则应确保投影至少每 10 小时构建一次。

Spark 配置文件(Spark Profiles)

投影使用自动扩缩机制来确定构建投影所需的执行器(Executor)数量。除非投影构建失败或耗时过长,否则您无需手动调整 Spark 配置文件。

成本

Foundry 会将与投影相关的任何成本(例如存储和计算成本)归属于主数据集的项目。

投影压缩

压缩是对投影执行的主要维护操作。该过程将大量小型排序文件合并为更大的排序文件。压缩作为投影构建过程的一部分,会在投影上自动执行。

压缩使读取性能不再依赖于主数据集上的输入事务数量。这使得投影能够加速对频繁增量写入或流式数据集的读取。投影构建偶尔可能比平均时间更长,这通常是由于压缩操作所致。

投影查询规划

如果存在可用于满足查询的投影,系统将始终优先选择该投影而非主数据集,即使主数据集的写入方式在支持特定查询时可能更优。这大大简化了查询规划的语义。

选择投影

对于以下查询,查询规划期间分配给各种投影的优先级如下:

  • 基于列的过滤: 优先选择在最多列上排序的投影,且全局排序投影优先于局部排序投影。例如,如果过滤条件为 x = 1 AND y = 2,则按以下优先级选择投影:
  • 在列 xy 上全局排序
  • 在列 xy 上局部排序(并在任意列集合上分桶)
  • 在列 x 上全局排序
  • 在列 x 上局部排序(并在任意列集合上分桶)
  • 基于列的连接(Join): 优先选择在精确连接列集合上分桶的投影。
  • 连接与过滤组合: 例如,如果查询在列 F 上过滤并在列 J 上连接,则按以下优先级选择投影:
  • 在列 J 上分桶并在列 F 上局部排序的投影
  • 在列 F 上全局排序的投影
  • 在列 F 上局部排序(并在除列 J 外的任意列上分桶)的投影
  • 在列 J 上分桶(并在除列 F 外的任意列上局部排序)的投影

这些优先级反映了以下观点:过滤通常具有足够的选择性,因此优先优化过滤而非连接更为合理,尽管实际情况可能并非总是如此。