跳转至

Compute engine selection(计算引擎选择)

Python transforms support multiple query engines to handle different data processing needs. Choosing the right engine ensures optimal performance, cost efficiency, and developer productivity for your use case.

The available query engines are DuckDB ↗, pandas ↗, Polars ↗, and Spark ↗. These all have Python libraries that enable data manipulation with simple and intuitive APIs.

DuckDB, Pandas and Polars are available through our standard, single-node compute offering also known as lightweight. Spark can be used for distributed compute approaches. For more information refer to the PySpark documentation.

Feature support comparison

The following table shows Foundry feature availability across compute paradigms for Python transforms. For a comparison between Pipeline Builder compute engines see Pipeline Type Support Comparison

Feature Single node (Lightweight) Distributed (Spark)
Incremental transforms
External transforms
Palantir-provided LLMs API: snapshot
Palantir-provided LLMs API: incremental
Trained model inputs
Trained model outputs
Media set API: snapshot
Media set API: incremental
Abort transactions
Dataset unmarking 1 ✓ (Except sever_permissions)
Source unmarking
Data expectations Limited
Tables API Compute pushdown only In-Foundry compute only
Read output enforcing schema 2
Allowed run duration parameter
Run as user parameter (deprecated)
Resource metrics

1 Single node transforms only support stop_propagating and stop_requiring. sever_permissions are not supported.
2 PySpark transforms allow you read data written to an incremental output with a specific schema. This is necessary during the first dataset transaction as no schema will be committed. It is not supported in single node transforms.

Available query engines

Pandas

Best for: Quick iteration, exploratory analysis, and small datasets

Pandas ↗ is a common data manipulation library in the Python ecosystem. It excels at rapid prototyping and provides an extensive ecosystem of compatible libraries. Use pandas when getting started with a new transform, or when your team needs to move quickly with familiar tools.

Key characteristics:

  • Immediate feedback during development
  • Extensive documentation and community support
  • Rich functionality for time series and statistical operations
  • Single-threaded execution model

Best for: Production data pipelines and medium-scale data processing

Polars ↗ should be your default choice for production transforms. Built on Apache Arrow with a Rust core, it delivers excellent performance through columnar storage and lazy evaluation. Polars combines the ease of DataFrame operations with the performance needed for production workloads.

Key characteristics:

  • Automatic query optimization through lazy evaluation
  • Multi-threaded execution on single nodes
  • Memory-efficient columnar storage
  • Predictable performance characteristics

DuckDB

Best for: Medium-scale data with tight latency bounds

DuckDB ↗ is a highly performant single-node SQL query engine optimized for analytical workloads. DuckDB is particularly well-suited for medium-to-large scale data processing tasks that require low latency and efficient resource usage. However, DuckDB lacks a Python DataFrame API, instead requiring users to write raw SQL queries for data manipulation.

Key characteristics:

  • Automatic query optimization through lazy evaluation
  • Automatic memory management with spill-to-disk
  • Processes raw SQL strings rather than a DataFrame API

Spark

Best for: Large-scale data processing and organizational data foundations

Spark ↗ is designed for distributed computing at scale. While it has higher overhead for small operations, it is the only option when your data exceeds single-node capacity or when building critical organizational datasets that require maximum scalability.

Key characteristics:

  • Distributed processing across multiple nodes
  • Automatic memory management with spill-to-disk
  • Battle-tested at petabyte scale
  • Catalyst optimizer for complex query planning

Choosing the right engine

:::callout{theme="success"} The size recommendations below are intended as a general rule of thumb and do not apply to all queries. For the right shapes of transforms, Lightweight engines can process even terabyte-scale inputs on a single node. Refer to the Polars lazy API documentation on larger-than-memory data transformations for more information on how to use Polars streaming. Queries that do not require all data to be loaded into memory at once will scale to arbitrary size on a single node. :::

We recommend starting with Polars as your default choice for production transforms. You can switch to pandas when you need quick iteration, or specific ecosystem libraries. DuckDB is an excellent choice for SQL APIs or maximization of single-node performance. Move to Spark only when data scale demands it, typically at more than 50 GB and when you cannot use optimizations such as filter pushdown.

Characteristic Pandas Polars DuckDB PySpark
Optimal (uncompressed) data size < 1GB 1-50GB 1-50GB > 50GB
Optimal number of rows* < 1 million 1-200 million 1-200 million > 200 million
Startup overhead Minimal Minimal Minimal Significant
Memory efficiency Poor Excellent Excellent Good
Syntax Python Dataframes Python Dataframes SQL** Python Dataframes
Processing speed (small data) Good Excellent Excellent Slow
Processing speed (medium data) Poor Excellent Excellent Fast
Processing speed (large data) Not suitable Variable Variable Excellent
Parallel execution No Single-node Single-node Distributed
Memory spilling No Limited Automatic Automatic

* The number of rows tolerable to each query engine will vary greatly depending on the schema. These numbers are given as a rough guide for common cases.

** Open-source adapter libraries exist (Ibis, SQLFrame).

Migrate from Spark to single node compute

A large portion of existing transforms use Spark, since it has been supported in Python transforms for longer than single node engines. Many of these pipelines could run faster or consume fewer resources if migrated to a single node compute engine. However, it can be difficult to tell how much of an impact this migration will have, and translation can be an expensive process. To get an initial indication of single node performance, we recommend migrating pipelines to DuckDB with SQLFrame before committing to full translations.


中文翻译

计算引擎选择

Python 转换(Transform)支持多种查询引擎以满足不同的数据处理需求。选择合适的引擎可以确保您的用例获得最佳性能、成本效益和开发效率。

可用的查询引擎包括 DuckDB ↗pandas ↗Polars ↗Spark ↗。这些引擎都提供了 Python 库,支持通过简单直观的 API 进行数据操作。

DuckDB、Pandas 和 Polars 可通过我们标准的单节点计算产品(也称为 lightweight)使用。Spark 可用于分布式计算方法。更多信息请参考 PySpark 文档。

功能支持对比

下表展示了 Python 转换(Transform)在不同计算范式下的 Foundry 功能可用性。关于 Pipeline Builder 计算引擎的对比,请参见 管道类型支持对比

功能 单节点 (轻量级) 分布式 (Spark)
增量转换(Incremental transforms)
外部转换(External transforms)
Palantir 提供的 LLM API:快照(snapshot)
Palantir 提供的 LLM API:增量(incremental)
训练模型输入(Trained model inputs)
训练模型输出(Trained model outputs)
媒体集 API(Media set API):快照(snapshot)
媒体集 API(Media set API):增量(incremental)
中止事务(Abort transactions)
数据集取消标记(Dataset unmarking) 1 ✓ (除 sever_permissions 外)
源取消标记(Source unmarking)
数据期望(Data expectations) 有限支持
表 API(Tables API) 仅支持计算下推(Compute pushdown) 仅限 Foundry 内计算
读取输出强制模式(Read output enforcing schema) 2
允许运行时长(Allowed run duration) 参数
以用户身份运行(Run as user) 参数 (已弃用)
资源指标(Resource metrics)

1 单节点转换(Transform)仅支持 stop_propagatingstop_requiring。不支持 sever_permissions
2 PySpark 转换(Transform)允许您读取写入到具有特定模式的增量输出的数据。这在首次数据集事务期间是必需的,因为此时尚未提交任何模式。单节点转换(Transform)不支持此功能。

可用查询引擎

Pandas

最佳用途: 快速迭代、探索性分析和小型数据集

Pandas ↗ 是 Python 生态系统中常用的数据操作库。它在快速原型开发方面表现出色,并提供了丰富的兼容库生态系统。当您开始使用新的转换(Transform),或者您的团队需要使用熟悉的工具快速推进时,请使用 pandas。

主要特点:

  • 开发过程中即时反馈
  • 丰富的文档和社区支持
  • 丰富的时间序列和统计操作功能
  • 单线程执行模型

Polars(推荐)

最佳用途: 生产数据管道和中型数据处理

Polars ↗ 应该是您生产环境转换(Transform)的默认选择。它基于 Apache Arrow 构建,核心采用 Rust 语言,通过列式存储和惰性求值提供卓越的性能。Polars 将 DataFrame 操作的简便性与生产工作负载所需的性能相结合。

主要特点:

  • 通过惰性求值实现自动查询优化
  • 单节点上的多线程执行
  • 内存高效的列式存储
  • 可预测的性能特征

DuckDB

最佳用途: 具有严格延迟限制的中型数据

DuckDB ↗ 是一个高性能的单节点 SQL 查询引擎,专为分析工作负载优化。DuckDB 特别适合需要低延迟和高效资源使用的中大型数据处理任务。然而,DuckDB 缺乏 Python DataFrame API,需要用户编写原始 SQL 查询来进行数据操作。

主要特点:

  • 通过惰性求值实现自动查询优化
  • 自动内存管理,支持溢出到磁盘
  • 处理原始 SQL 字符串,而非 DataFrame API

Spark

最佳用途: 大规模数据处理和组织数据基础

Spark ↗ 专为大规模分布式计算而设计。虽然对于小型操作来说开销较高,但当您的数据超出单节点容量,或者构建需要最大可扩展性的关键组织数据集时,它是唯一的选择。

主要特点:

  • 跨多个节点的分布式处理
  • 自动内存管理,支持溢出到磁盘
  • 经过 PB 级规模验证
  • Catalyst 优化器用于复杂查询规划

选择合适的引擎

:::callout{theme="success"} 以下关于数据大小的建议仅为一般性参考,并不适用于所有查询。对于适当形状的转换(Transform),轻量级引擎可以在单节点上处理甚至 TB 级的输入。有关如何使用 Polars 流式处理(streaming)的更多信息,请参阅 Polars 惰性 API 文档中关于超内存数据转换的内容。不需要一次性将所有数据加载到内存中的查询,可以在单节点上扩展到任意大小。 :::

我们建议将 Polars 作为生产环境转换(Transform)的默认选择。当您需要快速迭代或特定的生态系统库时,可以切换到 pandas。DuckDB 是 SQL API 或最大化单节点性能的绝佳选择。仅当数据规模要求时(通常超过 50 GB,且无法使用诸如过滤器下推(filter pushdown)等优化手段时),才迁移到 Spark。

特性 Pandas Polars DuckDB PySpark
最佳(未压缩)数据大小 < 1GB 1-50GB 1-50GB > 50GB
最佳行数* < 100 万 1-2 亿 1-2 亿 > 2 亿
启动开销 极小 极小 极小 显著
内存效率 优秀 优秀 良好
语法 Python Dataframes Python Dataframes SQL** Python Dataframes
处理速度(小数据) 良好 优秀 优秀
处理速度(中等数据) 优秀 优秀
处理速度(大数据) 不适用 可变 可变 优秀
并行执行 单节点 单节点 分布式
内存溢出到磁盘 有限 自动 自动

* 每个查询引擎可容忍的行数会因模式(schema)的不同而有很大差异。这些数字仅作为常见情况的粗略指南。

** 存在开源适配器库(Ibis、SQLFrame)。

从 Spark 迁移到单节点计算

现有的大量转换(Transform)使用 Spark,因为它在 Python 转换(Transform)中的支持时间比单节点引擎更长。如果迁移到单节点计算引擎,其中许多管道可以运行得更快或消耗更少的资源。然而,很难判断这种迁移会产生多大影响,并且转换过程可能成本高昂。为了初步了解单节点性能,我们建议在投入全面转换之前,先使用 SQLFrame 将管道迁移到 DuckDB