Comparison of projections and hive-style partitioning(投影(Projections)与Hive风格分区(Hive-style partitioning)对比)¶
The use cases for projections (in particular, filter-optimized projections) partially overlap with those for hive-style partitioning. This page outlines some of the key differences between the two techniques, which include:
- Automatic compaction for projections
- Immediate usability for hive-style partitioning
- Data content constraints for projections
- Suitability for high-cardinality columns
- Supported consumers
Automatic compaction¶
Projections are automatically compacted to prevent excessive incremental file build-up over time. There is no automatic compaction for datasets that are written with hive-style partitioning.
Immediate usability¶
When a dataset is written with hive-style partitioning, readers will immediately benefit from optimized queries. However, because projection datasets are built asynchronously after the transaction is committed on the canonical dataset, a SNAPSHOT transaction on the canonical dataset will result in a period of time when readers do not benefit at all until the next projection build completes.
Data content constraints¶
Datasets written with hive-style partitioning are typically written in the open Apache Parquet format, which does not impose any notable constraints on data content. However, filter-optimized projections use a proprietary format that has a few constraints (most notably, string columns with very long values are not supported).
Suitability for high-cardinality columns¶
With hive-style partitioning, at least one file is written per unique value, so performing hive-style partitioning on a high-cardinality column will result in an excessive amount of output files. Filter-optimized projections are designed to handle high-cardinality columns well (especially timestamp columns in the context of timeseries data).
Supported consumers¶
Spark, Polars, and other query engines can benefit from hive-style partitioning. However, only Spark (specifically Foundry Spark, the Spark version used within Foundry) is able to benefit from filter-optimized projections.
Additionally, code in Python or Java that uses the transforms filesystem APIs can leverage the information stored in file paths with hive-style partitioning to implement advanced custom optimizations. Projections are not exposed to user code in this way.
中文翻译¶
投影(Projections)与Hive风格分区(Hive-style partitioning)对比¶
投影(尤其是过滤优化投影(Filter-optimized projections))的适用场景与Hive风格分区存在部分重叠。本页概述了这两种技术的一些关键区别,主要包括:
自动压缩¶
系统会对投影自动执行压缩(Compaction),以防止增量文件随时间推移而过度堆积。而对于采用Hive风格分区写入的数据集(Datasets),则不提供自动压缩功能。
即时可用性¶
当数据集采用Hive风格分区写入时,读取端可立即享受到查询优化带来的优势。然而,由于投影数据集是在规范数据集(Canonical dataset)的事务提交后异步构建的,因此在规范数据集上执行SNAPSHOT事务后,直到下一次投影构建完成前,会存在一段读取端完全无法获益的空窗期。
数据内容约束¶
采用Hive风格分区写入的数据集通常使用开放的Apache Parquet格式,该格式对数据内容没有显著限制。相比之下,过滤优化投影采用专有格式,存在若干限制(其中最突出的是不支持包含超长值的字符串列)。
高基数列适用性¶
在Hive风格分区中,每个唯一值至少会生成一个文件,因此对高基数列(High-cardinality columns)执行Hive风格分区会导致输出文件数量过多。而过滤优化投影专为高效处理高基数列而设计(尤其是在时序数据场景下的时间戳列)。
支持的消费端¶
Spark、Polars及其他查询引擎均可利用Hive风格分区提升性能。然而,只有Spark(具体而言是Foundry Spark,即Foundry内部使用的Spark版本)能够利用过滤优化投影的优势。
此外,使用transforms文件系统API的Python或Java代码可以利用Hive风格分区文件路径中存储的信息,来实现高级自定义优化。而投影则不会以这种方式向用户代码暴露其底层信息。