Comparison of projections and hive-style partitioning（投影(Projections)与Hive风格分区(Hive-style partitioning)对比）¶

The use cases for projections (in particular, filter-optimized projections) partially overlap with those for hive-style partitioning. This page outlines some of the key differences between the two techniques, which include:

Automatic compaction for projections
Immediate usability for hive-style partitioning
Data content constraints for projections
Suitability for high-cardinality columns
Supported consumers

Automatic compaction¶

Projections are automatically compacted to prevent excessive incremental file build-up over time. There is no automatic compaction for datasets that are written with hive-style partitioning.

Immediate usability¶

When a dataset is written with hive-style partitioning, readers will immediately benefit from optimized queries. However, because projection datasets are built asynchronously after the transaction is committed on the canonical dataset, a SNAPSHOT transaction on the canonical dataset will result in a period of time when readers do not benefit at all until the next projection build completes.

Data content constraints¶

Datasets written with hive-style partitioning are typically written in the open Apache Parquet format, which does not impose any notable constraints on data content. However, filter-optimized projections use a proprietary format that has a few constraints (most notably, string columns with very long values are not supported).

Suitability for high-cardinality columns¶

With hive-style partitioning, at least one file is written per unique value, so performing hive-style partitioning on a high-cardinality column will result in an excessive amount of output files. Filter-optimized projections are designed to handle high-cardinality columns well (especially timestamp columns in the context of timeseries data).

Supported consumers¶

Spark, Polars, and other query engines can benefit from hive-style partitioning. However, only Spark (specifically Foundry Spark, the Spark version used within Foundry) is able to benefit from filter-optimized projections.

Additionally, code in Python or Java that uses the transforms filesystem APIs can leverage the information stored in file paths with hive-style partitioning to implement advanced custom optimizations. Projections are not exposed to user code in this way.

中文翻译¶

投影(Projections)与Hive风格分区(Hive-style partitioning)对比¶

投影（尤其是过滤优化投影(Filter-optimized projections)）的适用场景与Hive风格分区存在部分重叠。本页概述了这两种技术的一些关键区别，主要包括：

投影的自动压缩
Hive风格分区的即时可用性
投影的数据内容约束
高基数列适用性
支持的消费端

自动压缩¶

系统会对投影自动执行压缩(Compaction)，以防止增量文件随时间推移而过度堆积。而对于采用Hive风格分区写入的数据集(Datasets)，则不提供自动压缩功能。

即时可用性¶

当数据集采用Hive风格分区写入时，读取端可立即享受到查询优化带来的优势。然而，由于投影数据集是在规范数据集(Canonical dataset)的事务提交后异步构建的，因此在规范数据集上执行SNAPSHOT事务后，直到下一次投影构建完成前，会存在一段读取端完全无法获益的空窗期。

数据内容约束¶

采用Hive风格分区写入的数据集通常使用开放的Apache Parquet格式，该格式对数据内容没有显著限制。相比之下，过滤优化投影采用专有格式，存在若干限制（其中最突出的是不支持包含超长值的字符串列）。

高基数列适用性¶

在Hive风格分区中，每个唯一值至少会生成一个文件，因此对高基数列(High-cardinality columns)执行Hive风格分区会导致输出文件数量过多。而过滤优化投影专为高效处理高基数列而设计（尤其是在时序数据场景下的时间戳列）。

支持的消费端¶

Spark、Polars及其他查询引擎均可利用Hive风格分区提升性能。然而，只有Spark（具体而言是Foundry Spark，即Foundry内部使用的Spark版本）能够利用过滤优化投影的优势。

此外，使用transforms文件系统API的Python或Java代码可以利用Hive风格分区文件路径中存储的信息，来实现高级自定义优化。而投影则不会以这种方式向用户代码暴露其底层信息。