Building pipelines（构建数据管道）¶

The very first steps to creating a data pipeline are to connect organizational data sources to Foundry and get data flowing through the system. Initially, the emphasis should be on validating that data is high-quality and can serve as a reliable foundation for use case development, model development, and analytics.

This section of documentation focuses on the initial stages of creating a pipeline, when business requirements may still be in flux and changes to pipeline logic are occurring frequently. In this phase, the emphasis is on laying a solid foundation—both to support target use cases and to enable pipeline maintenance in the future.

:::callout{theme="success" title="Palantir Learning portal"} You can find a deep dive course on building your first pipeline at learn.palantir.com ↗. :::

Initial steps¶

In most cases, these are the initial steps you should follow in pipeline development:

Set up the recommended Project structure so that data security and governance are organized from the very beginning of your development process.
Create a batch pipeline in Pipeline Builder or Code Repositories to process input datasets, perform data cleaning and filtering, and join with other datasets to create high-quality datasets that can feed into the Ontology to support workflow development.
Map your final datasets into object types and link types in the Ontology.
Set up schedules so that data begins to flow through regularly.

Beyond these steps, there are a number of steps you can take to make your pipeline more robust and scalable, including adding unit tests, setting up a branching and release process, and defining health checks. Learn about best practices for pipeline development.

Incremental pipelines¶

If the scale of changes to the input data flowing into your pipeline is high, it may be best to create an incremental pipeline to process the changed data in a performant way. In most cases, you can begin with a batch pipeline and put an incremental pipeline into place afterwards to improve performance and reduce latency.

In some cases, it is preferable to design your pipeline to be incremental from the start, especially when you know that the scale of new data flowing into your pipeline will be very high. However, writing and maintaining incremental pipelines comes with much more complexity than batch pipelines. Learn more about the different types of pipelines in Foundry.

Streaming pipelines¶

If the requirements for the latency of your data are very low, it may be best to create a streaming pipeline to process input data in a performant way. Given streaming pipelines are only as fast as their slowest component, pipelines should be designed from the start to ensure the pipeline will hit the target latency and throughput. Review our comparison of streaming versus batch processes for a more nuanced analysis.

中文翻译¶

构建数据管道¶

构建数据管道的第一步是将组织数据源连接到 Foundry，并让数据在系统中流转。初期阶段应重点验证数据质量，确保其能够作为用例开发、模型构建和分析工作的可靠基础。

本文档的这一部分聚焦于管道创建的初始阶段，此时业务需求可能仍在变化，管道逻辑的调整也较为频繁。在此阶段，重点在于奠定坚实基础——既要支撑目标用例，也要为未来的管道维护提供保障。

:::callout{theme="success" title="Palantir 学习门户"} 您可以在 learn.palantir.com ↗ 找到关于构建首个数据管道的深度课程。 :::

初始步骤¶

在大多数情况下，管道开发应遵循以下初始步骤：

设置推荐的项目结构，以便从开发流程一开始就组织好数据安全与治理。
在管道构建器(Pipeline Builder)或代码仓库(Code Repositories)中创建批处理管道，用于处理输入数据集、执行数据清洗和过滤，并与其他数据集进行连接，从而生成高质量数据集，这些数据集可输入到本体论(Ontology)中以支持工作流开发。
将最终数据集映射到本体论中的对象类型(Object Types)和链接类型(Link Types)。
设置调度(Schedules)，使数据能够定期流转。

除上述步骤外，您还可以采取多项措施来增强管道的健壮性和可扩展性，包括添加单元测试、建立分支与发布流程、以及定义健康检查。了解更多关于管道开发的最佳实践。

增量管道¶

如果输入管道的数据变更量较大，最好创建增量管道(Incremental Pipeline)，以高性能方式处理变更数据。在大多数情况下，您可以先使用批处理管道，之后再引入增量管道来提升性能并降低延迟。

在某些情况下，最好从一开始就将管道设计为增量模式，特别是当您预知新数据流入量会非常大时。然而，编写和维护增量管道的复杂度远高于批处理管道。了解更多关于Foundry 中不同类型的管道。

流式管道¶

如果对数据延迟的要求非常低，最好创建流式管道(Streaming Pipeline)，以高性能方式处理输入数据。由于流式管道的速度受限于其最慢的组件，因此管道应从一开始就进行设计，以确保能够达到目标延迟和吞吐量。请查阅我们关于流式处理与批处理对比的分析，以获得更细致的解读。