Comparison: Code Repositories vs. Code Workspaces vs. Code Workbook(对比:代码仓库(Code Repositories)、代码工作区(Code Workspaces)与代码工作簿(Code Workbook))¶
Foundry has three products available for writing code-based data transformations: Code Workbook, Code Workspaces, and Code Repositories. While there is some feature overlap between these products, each is geared toward distinct workflows and user types. The guide below is intended to help you determine which tool is best suited to your needs.
Code Repositories is recommended for creating robust production pipelines and supporting workflows that require an additional layer of governance and scrutiny. With Code Repositories, data engineers can create efficient pipelines in bulk. Example workflows that are a good fit for Code Repositories include:
- A daily pipeline at high data scale which requires incremental compute.
- A high-visibility pipeline with strict governance requirements to be able to revert to previous versions of historical code, or gate code changes on unit tests passing.
Code Workspaces is recommended for quick and efficient exploratory analyses using JupyterLab® and RStudio® Workbench to combine familiar IDEs with the benefits of the Foundry platform, such as data security, branching, build scheduling, and resource management. Example workflows that are a good fit for Code Workspaces include:
- Running a cell-by-cell data analysis and exporting its contents to a shareable report
- Prototyping a data transformation pipeline or a machine learning model
Code Workbook is recommended for performing code-based analyses on high-scale data that would not otherwise be suitable for Code Workspaces. These analyses can be for one-time use or could produce an artifact that is updated on a recurring basis. Code Workbook can also be used to prototype pipelines, which can then be promoted to repositories. Example workflows that are a good fit for Code Workbook include:
- Investigating the results of a clinical trial by testing out different p-values.
- Creating interactive visualizations to share with others.
In addition to the code-based products above, Pipeline Builder is the Palantir platform's primary no-code application for building and maintaining production data pipelines. Pipeline Builder uses a graph and form-based interface, enabling users to integrate data and create business logic transformations without writing code. If you are evaluating whether to use Pipeline Builder or Code Repositories for your pipeline, see Considerations: Pipeline Builder and Code Repositories.
Comparison summary¶
| Code Repositories | Code Workspaces | Code Workbook | |
|---|---|---|---|
| Features | Advanced pipelines | Exploratory analysis | Advanced analysis |
| Enables complex workflows in long-lasting data pipelines with flexibility in performance optimization and code generation. | Enables interactive exploratory workflows using familiar IDEs tied with Foundry primitives. | Enables data analysis workflows with support for common analytical languages and visualization libraries. | |
| Languages supported | Python, SQL, Java, Mesa | Python, R | Python, R, SQL |
| Environments Supported | All environments | Kubernetes environments only | All environments |
| Batch Pipeline support | Yes | Yes | Yes |
| Incremental computation | Yes | No | No |
| Transform generation | Yes | No | No |
| Multi-output transforms | Yes | Yes | No |
| Filesystem access | Yes | Yes | Yes |
| Visualization support | No | Yes | Yes |
| Iteration cycle | Iterate on code logic | Iterate on data discovery and analysis | Iterate on insight generation |
| Designed to help iterate on code logic. Runtime debugger and previews can assist in validating transform logic. Data can be analyzed in Foundry after building. | Designed to help rapidly iterate on data discovery and analysis using widely known tools that seamlessly integrate with the rest of Foundry. | Designed to help generate insights from data; all transforms run on the full input data, interactive console enables ad-hoc queries, and Spark execution model is optimized for quick iteration. | |
| Full data preview | Preview data sample, with the ability to pre-filter the input sample | Full data preview | Full data preview |
| Debugger | Yes | No | No |
| Console support | In debug mode | Yes | Yes |
| Spark module management | Spark modules initiated at the job level | Spark-less environment for fast feedback loop | Spark modules kept warm for immediate interactivity, and initiated at the workbook level |
| Operations | Data pipeline management | Data exploration management | Data analysis management |
| Supports Foundry data management libraries and publishing custom Python libraries | Fully adjustable environment that can consume pip, CRAN, and conda libraries, including those published from Code Repositories | Can consume custom libraries published from Code Repositories; users can save pieces of logic as code templates, enabling point-and-click analysis by other users. | |
| Data Expectations | Yes | No | No |
| Publish custom libraries | Yes | No | No |
| Consume custom libraries | Yes | Yes | Yes, for some environments |
| Point-and-click code templates | No | No | Yes |
| Change management | Governance | Flexibility | Rapid changes |
| Prioritizes change traceability and governance to ensure that critical pipelines remain secure and robust; advanced review and approval workflows and complete changelogs. | Prioritizes rapid and flexible iteration with full branching support and automatic Git versioning. | Prioritizes rapid iteration and collaboration with a lightweight branching workflow; does not require CI checks or unit testing. | |
| Full Git workflow | Yes | Yes | No |
| Copy data after merge | No | No | Yes |
| Administer and remove security markings | Yes | No | No |
| Impact analysis views | Yes | No | No |
| Advanced code review workflows | Yes | No | No |
| Unit testing | Yes | No | No |
Table summary
##### Code Repositories features * Code Repositories features advanced pipelines and enables complex workflows in long-lasting data pipelines with flexibility in performance optimization and code generation. * Languages supported in Code Repositories include Python, SQL, Java, and Mesa. * Code Repositories supports [incremental computation](https://palantir.com/docs/foundry/transforms-python-spark/incremental-examples/), [transform generation](https://palantir.com/docs/foundry/transforms-python/pipelines/#automatic-registration), [multi-output transforms](https://palantir.com/docs/foundry/transforms-python/transforms/#define-transforms), and [filesystem access](https://palantir.com/docs/foundry/transforms-python/unstructured-files/). * Code Repositories does not support visualizations. ##### Code Workspaces features * Code Workspaces features quick and efficient exploratory workflows with an embedded support for JupyterLab® and RStudio® Workbench in Foundry. * Languages supported in Code Workspaces include Python and R. * Code Workspaces supports [filesystem access](https://palantir.com/docs/foundry/code-workspaces/data/#non-tabular-datasets) and provides full flexibility on notebook-based analyses. * Code Workspaces does not support distributed Spark, and is therefore better suited for data that can fit within the workspace's [compute limits](https://palantir.com/docs/foundry/code-workspaces/compute-usage/#understanding-drivers-of-foundry-compute-usage-in-code-workspaces). ##### Code Workbook features * Code Workbook features advanced analysis analysis workflows with support for common analytical languages and visualization libraries. * Languages supported in Code Workbook include Python, R, and SQL. * Code Workbook supports [filesystem access](https://palantir.com/docs/foundry/code-workbook/transforms-unstructured/) and [visualization](https://palantir.com/docs/foundry/code-workbook/transforms-visualize/). * Code Workbook does not support incremental computation, transform generation, or multi-output transforms. ##### Code Repositories iteration cycle * Code Repositories is designed to help iterate on code logic. Data can be analyzed in Foundry after building. * Code Repositories supports data sample previews to validate transform logic, with the ability to pre-filter the input sample. * Code Repositories supports [debugging at runtime](https://palantir.com/docs/foundry/code-repositories/debug-transforms/). * In Code Repositories, Spark modules are initiated at the job level. ##### Code Workspaces iteration cycle * Code Workspaces is designed to help explore and analyze data. Results can then be shared, published to dashboards, turned into re-usable transforms, or exported to production-ready pipeline tools such as Code Repositories or Pipeline Builder. * Code Workspaces offers the full flexibility of the JupyterLab® and RStudio® Workbench IDEs, including full code and data previews. * Code Workspaces provides cell-by-cell iteration for instant feedback on code execution. * In Code Workspaces, no Spark modules are required and a fully customizable kernel is available for ad-hoc adjustments of the environment. ##### Code Workbook iteration cycle * Code Workbook is designed to help generate insights from data. All transforms run on the full input data, and Spark execution models are optimized for quick iteration. * Code Workbook supports full data previews. * Code Workbook provides [console support](https://palantir.com/docs/foundry/code-workbook/workbooks-console/) for ad-hoc analysis of transforms. * In Code Workbook, Spark modules are kept warm for immediate interactivity and initiated at the workbook level. ##### Code Repositories operations * Code Repositories supports Foundry data management libraries and custom Python libraries. * Code Repositories supports [data expectations](https://palantir.com/docs/foundry/transforms-python/data-expectations-getting-started/), [publishing](https://palantir.com/docs/foundry/transforms-python/share-python-libraries/) custom libraries, and consuming custom libraries. * Code Repositories does not support point-and-click code templates. ##### Code Workspaces operations * Code Workspaces can consume pip, CRAN, and conda libraries, including those published from Code Repositories, and environments can be modified quickly. * Code Workspaces does not support data expectations or publishing custom libraries. * Code Workspaces does not support point-and-click code templates. ##### Code Workbook operations * Code Workbook can consume custom libraries published from Code Repositories, and users can save pieces of logic as code templates, enabling point-and-click analysis by other users. * Code Workbook does not support data expectations or publishing custom libraries. * Code Workbook does [consume custom libraries](https://palantir.com/docs/foundry/code-workbook/environment-overview/) for some Spark environments. * Code Workbook supports [point-and-click templates](https://palantir.com/docs/foundry/code-workbook/templates-overview/). ##### Code Repositories change management * Code Repositories prioritizes change traceability and governance to ensure that critical pipelines remain secure and robust. * Code Repositories provides complete changelogs. * Code Repositories provides a [full Git workflow](https://palantir.com/docs/foundry/building-pipelines/branching-release-process/), security marking administration and [removal](https://palantir.com/docs/foundry/building-pipelines/remove-inherited-markings/), [impact analysis](https://palantir.com/docs/foundry/code-repositories/analyze-impact/) views, [advanced code review](https://palantir.com/docs/foundry/code-repositories/branch-settings/#protected-branches) workflows, and [unit testing](https://palantir.com/docs/foundry/code-repositories/unit-tests/). * Code Repositories does not support copying data after merging. ##### Code Workspaces change management * Code Workspaces prioritizes rapid and flexible iteration with full branching support and automatic Git versioning. * Code Workspaces are fully backed by Code Repositories and benefit from their [full Git workflow](https://palantir.com/docs/foundry/building-pipelines/branching-release-process/). * Code Workspaces does not support copying data after merging. * Code Workspaces stores safe checkpoints of its notebook's contents for 30 days, allowing users to safely retain and retrieve any given state, while also providing the opportunity to permanently store backups of the code in the Git repository. ##### Code Workbook change management * Code Workbook prioritizes rapid iteration and collaboration with a lightweight branching workflow. Code Workbook does not require CI checks or unit testing. * Code Workbook supports copying data after merging. * Code Workbook does not provide a full Git workflow, security marking administration or removal, impact analysis views, advanced code review workflows, or unit testing. *** JupyterLab® is a registered trademark of NumFOCUS. RStudio® is a trademark of Posit™.中文翻译¶
对比:代码仓库(Code Repositories)、代码工作区(Code Workspaces)与代码工作簿(Code Workbook)¶
Foundry提供三款可用于编写代码化数据转换的产品:Code Workbook、Code Workspaces与Code Repositories。尽管三款产品存在部分功能重叠,但各自面向不同的工作流与用户群体。以下指南将帮助你判断哪款工具最适配你的需求。
代码仓库(Code Repositories)推荐用于构建稳健的生产流水线,支持需要额外治理与审查层级的工作流。借助代码仓库,数据工程师可批量创建高效流水线。适合使用代码仓库的工作流示例包括: * 需增量计算(incremental computation)的超大规模每日流水线 * 有严格治理要求的高关注度流水线,支持回滚到历史代码版本,或要求单元测试(unit testing)通过后方可合并代码变更
代码工作区(Code Workspaces)推荐用于快速高效的探索性分析,它将用户熟悉的集成开发环境(IDE)与Foundry平台的数安保障、分支管理、构建调度、资源管理等优势相结合,原生支持JupyterLab®与RStudio® Workbench。适合使用代码工作区的工作流示例包括: * 逐单元运行数据分析,并将结果导出为可共享的报告 * 为数据转换流水线或机器学习模型制作原型
代码工作簿(Code Workbook)推荐用于对超大规模数据开展代码化分析,这类场景通常不适用于代码工作区。这类分析可以是一次性需求,也可以生成定期更新的产出物。代码工作簿也可用于制作流水线原型,后续可升级迁移到代码仓库。适合使用代码工作簿的工作流示例包括: * 通过测试不同p值分析临床试验结果 * 创建可共享给他人的交互式可视化图表
除了上述代码类产品外,流水线构建器(Pipeline Builder)是Palantir平台的核心无代码应用,用于构建和维护生产级数据流水线。流水线构建器采用图形化+表单的交互界面,用户无需编写代码即可完成数据集成、配置业务逻辑转换。如果你正在评估搭建流水线应选用流水线构建器还是代码仓库,请参考考量因素:流水线构建器与代码仓库。
对比汇总¶
| Code Repositories | Code Workspaces | Code Workbook | |
|---|---|---|---|
| 功能 | 高级流水线能力 | 探索性分析能力 | 高级分析能力 |
| 支持长生命周期数据流水线的复杂工作流,可灵活开展性能优化与转换生成(transform generation)。 | 支持使用用户熟悉的IDE开展交互式探索工作流,与Foundry原生能力深度打通。 | 支持数据分析工作流,兼容常用分析语言与可视化库。 | |
| 支持语言 | Python, SQL, Java, Mesa | Python, R | Python, R, SQL |
| 支持环境 | 全环境支持 | 仅支持Kubernetes环境 | 全环境支持 |
| 批处理流水线(Batch Pipeline)支持 | 是 | 是 | 是 |
| 增量计算 | 是 | 否 | 否 |
| 转换生成 | 是 | 否 | 否 |
| 多输出转换(multi-output transforms) | 是 | 是 | 否 |
| 文件系统访问(filesystem access) | 是 | 是 | 是 |
| 可视化支持 | 否 | 是 | 是 |
| 迭代周期 | 代码逻辑迭代 | 数据探索与分析迭代 | 洞察生成迭代 |
| 专为迭代代码逻辑设计,运行时调试器(debugger)与预览功能可帮助验证转换逻辑,构建完成后可在Foundry中分析数据。 | 专为快速迭代数据探索与分析设计,使用广泛普及的工具,可与Foundry其余能力无缝集成。 | 专为从数据中生成洞察设计,所有转换都基于全量输入数据运行,交互式控制台支持即席查询,Spark执行模型针对快速迭代做了优化。 | |
| 全量数据预览 | 支持数据样本预览,可对输入样本进行预过滤 | 全量数据预览 | 全量数据预览 |
| 调试器 | 是 | 否 | 否 |
| 控制台支持 | 仅调试模式下支持 | 是 | 是 |
| Spark模块管理 | Spark模块在作业级别启动 | 无Spark环境,可实现快速反馈循环 | Spark模块保持预热状态以实现即时交互,在工作簿级别启动 |
| 运维管理 | 数据流水线管理 | 数据探索管理 | 数据分析管理 |
| 支持Foundry数据管理库,可发布自定义Python库。 | 环境完全可自定义,可安装pip、CRAN、conda库,包括从代码仓库发布的库。 | 可使用从代码仓库发布的自定义库;用户可将逻辑片段保存为代码模板,其他用户可通过点选方式开展分析。 | |
| 数据期望(Data Expectations) | 是 | 否 | 否 |
| 发布自定义库 | 是 | 否 | 否 |
| 使用自定义库 | 是 | 是 | 部分环境支持 |
| 点选式代码模板 | 否 | 否 | 是 |
| 变更管理 | 治理优先 | 灵活优先 | 快速变更优先 |
| 优先保障变更可追溯与合规治理,确保关键流水线安全稳健;提供高级审核与审批工作流,以及完整的变更日志。 | 优先保障快速灵活的迭代,支持完整分支能力与自动Git版本控制。 | 优先保障快速迭代与协作,采用轻量级分支工作流,无需持续集成(CI)检查或单元测试。 | |
| 完整Git工作流 | 是 | 是 | 否 |
| 合并后复制数据 | 否 | 否 | 是 |
| 管理与移除安全标记 | 是 | 否 | 否 |
| 影响分析视图 | 是 | 否 | 否 |
| 高级代码评审工作流 | 是 | 否 | 否 |
| 单元测试 | 是 | 否 | 否 |