Building a production pipeline(构建生产级流水线)¶
Production workflows need reliable pipelines to back them. Following the principles laid out in this document when building a pipeline will result in easier maintenance, allowing you to catch problems before they cause SLA breaches. Some guidance here will also make it easier to share knowledge about what is important in your pipeline. This is important in all stages of the pipeline’s lifecycle, from development all the way through to long-term maintenance.
This document is useful for both pipeline developers and pipeline maintainers. For developers, it is useful before you start building a pipeline if it will be going straight into production. Equally, this document can be used when a proof-of-concept pipeline is being converted into a production pipeline. For pipeline maintainers, the following elements should be prerequisites to entering maintenance mode.
Pipeline definition and expectations¶
While it may not always be possible to have definitive answers around expectations before you start building a production pipeline, it is valuable to be mindful about them early as possible. It’s highly recommended to document the pipeline definition and expectations as you establish them.
The expectations influence several aspects of design and set-up of a production pipeline, including:
- The pipeline design and architecture decisions.
- How schedules should be set up (as this is the primary way to control how frequently the pipeline is built).
- What validations and monitoring are required.
- How to prioritize issues when the pipeline is in maintenance mode.
The important questions that should be addressed by your team, include:
- What exactly is the scope of the pipeline? Where does it start, where does it end? Where should it feed into other pipelines?
- If parts of your pipeline overlap with another pipeline or use case, consider treating the overlapping section as its own pipeline.
- What is the requirement on pipeline refresh rate?
- Or perhaps there is a specific time of day when data needs to be refreshed by?
- Should the pipeline run over weekends?
- When is data considered critically out of date?
- What will the expectation in terms of refresh rate and support?
- Be careful — while this sounds easy, this area is where pipeline maintenance teams often face the most difficulty. Without clear definitions here, it is difficult to prioritize work as a pipeline maintainer or make the right fix.
- What is the expectation for end-to-end propagation delay? In other words, how long should it take for data to flow through the pipeline from the moment it lands in Foundry, to the point where the outputs of the pipeline are updated?
- Who’s your contact point on each and every external source that you pull from? External sources can be Data Connection syncs or a separate upstream Foundry pipeline that feeds into your pipeline.
- What are the functional guarantees for correctness in your data? Are there critical columns or key validations that must be true for your pipeline?
- Note: it’s important to determine what the outcome should be if a guarantee is broken. Should a failing validation prevent data from reaching your end user or should it fire an alert without preventing your pipeline from updating? You may want to do the latter to allow other up-to-date data to flow through your pipeline? The implementation for these two different situations would differ. Documentation is a good place to start tracking guarantees early on.
- Are the expectations determined compatible? As complex systems grow, they become more prone to failures. You generally want to allow sufficient time after an alert is fired to address the underlying issue - whether that is an unexpected pipeline failure or missing data from an upstream source.
- Example 1: It’s important to think about the propagation delay in relation to the expected pipeline refresh rate. If the expected refresh rate is every 2 hours, and the pipeline takes 1.5 hours to build, it may become difficult to adhere to SLAs.
- Example 2: if a workflow requires an hourly refresh rate, but the upstream data source only provides data twice a day. This means that you will not be able to achieve hourly refreshes.
Principles for production pipelines¶
The key principle for setting up a successful production pipeline can be summarized as: “build with the idea that you won’t be around to maintain it”.
Some concrete tips that can help you achieve this:
- Version your code, and write meaningful commit messages: this makes it easier to keep track of what changes were made, when they were committed and by whom.
- In production pipelines, Java and Python are recommended if possible. There are a lot of available resources and a strong developer community for both these languages. Whilst SQL is very accessible, it can get convoluted very quickly and hence can be more difficult to debug.
- Optimize for legibility above all else: Simple pipelines are simpler to maintain. Opting for standard or existing solutions in transform code will reduce the maintenance burden and development complexity down the line. If you must do something unconventional, be sure to document it thoroughly.
- Linear pipelines where possible: In terms of general architecture, the pipelines that are easier to maintain are those that are fairly straightforward and linear, but it’s difficult to give recommendations on how to get there. Be mindful of the general structure of the pipeline, and optimize for readability and simplicity as much as possible given your constraints. Once you are thinking about moving the pipeline to production, taking some time to untangle the pipeline is usually time well spent.
- Production pipelines are often long-lived. As a result there’s a good chance that those who write the pipelines are not those who maintain or manage the pipelines in the long-term. Keep in mind that the next person who takes over the pipeline development or maintenance needs to be able to read the code and make sense of the pipeline setup.
- Keep concise documentation:
- Logic documentation: It is generally advised that documentation about the logic itself is kept in code comments.
- Overall pipeline documentation (including documenting recurring issues in a pipeline): should also be kept close to the pipeline, in an intuitive location. A good example place for this would be in the Project that the pipeline belongs to. It should be easy to find for pipeline developers and maintainers.
- Bear in mind: over-documenting can also make it hard to read and less useful. Be concise and document key information. If you need to capture very long and detailed pipeline information that will not be referred to as often, consider keeping it in a separate document.
Development process and infrastructure setup¶
Development
Once a pipeline goes into production, it’s important to ensure that the development processes around making further contributions to the pipeline are established and effectively communicated to pipeline developers. This ensures that no unexpected outages on the pipeline occur.
As a result, we recommend reading:
- Development Best Practices: At the very least, production pipelines should be easy-to-read, easy-to-maintain, and with locked master branches. However to ensure your development setup is more robust, we recommend reading through the more detailed advice in this documentation.
- Branching and Pipeline release process: if you haven’t already been working with a branching and pipeline release process during the development phase of your pipeline, we recommend this when moving your pipeline into production. The documentation will provide an example process that you can use.
Infrastructure (schedules)
On a related note, setting up schedules early will allow you to develop without having to think about manually triggering builds of different parts of your pipeline. If your scheduling is messy, you may find that this hinders your development and slows you down as changes are not propagating automatically through your pipeline.
When moving a pipeline into production, it is also recommended to review and re-structure your schedules as the schedules used during development may no longer make sense or may contain antipatterns. This step should be done before moving into maintenance mode.
Setting up or reviewing schedules according to best practices can be achieved by following the scheduling best practices documentation.
Start monitoring early¶
As soon as the pipelines are running regularly, you want to start monitoring its behavior. This prevents tech debt from accumulating and allows you to track if the expectations for the pipeline are realistic or not. It may not always be feasible to start monitoring right away, given deadlines and the capacities of your team, however it is encouraged where possible.
See the documentation on Pipeline Monitoring Best Practices for more information on how to set up monitoring.
中文翻译¶
构建生产级流水线¶
生产工作流需要可靠的流水线作为支撑。在构建流水线时遵循本文档中阐述的原则,将有助于简化维护工作,使您能够在问题导致服务水平协议(SLA)违规之前及时发现。本文中的部分指导还能帮助您更轻松地分享关于流水线中关键要素的知识。这在流水线生命周期的各个阶段(从开发到长期维护)都至关重要。
本文档对流水线开发者和维护者均适用。对于开发者而言,在构建即将直接投入生产的流水线之前阅读本文会很有帮助。同样,当概念验证(PoC)流水线正在转化为生产流水线时,也可参考本文档。对于流水线维护者而言,以下要素应作为进入维护模式的先决条件。
流水线定义与预期¶
虽然在开始构建生产流水线之前,并不总能对预期目标有明确答案,但尽早关注这些问题仍具有重要价值。强烈建议在确定流水线定义和预期时将其记录下来。
这些预期会影响生产流水线设计和设置的多个方面,包括:
- 流水线设计和架构决策
- 调度方式的设置(因为这是控制流水线构建频率的主要方式)
- 需要哪些验证和监控
- 当流水线处于维护模式时如何确定问题优先级
您的团队应解决的重要问题包括:
- 流水线的确切范围是什么? 起点在哪里,终点在哪里?它应该接入哪些其他流水线?
- 如果流水线的某些部分与其他流水线或用例重叠,建议将重叠部分视为独立的流水线。
- 对流水线刷新频率有什么要求?
- 或者是否有特定的时间点需要完成数据刷新?
- 流水线是否需要在周末运行?
- 数据在什么情况下被视为严重过时?
- 对刷新频率和支持有什么预期?
- 请注意——虽然这听起来很简单,但这正是流水线维护团队经常面临最大困难的领域。如果没有明确的定义,作为流水线维护者将难以确定工作优先级或做出正确的修复。
- 端到端传播延迟(Propagation Delay)的预期是多少? 换句话说,从数据进入Foundry到流水线输出更新完成,数据流经整个流水线需要多长时间?
- 您所依赖的每个外部源的联系人是谁? 外部源可以是数据连接(Data Connection)同步,也可以是接入您流水线的独立上游Foundry流水线。
- 数据正确性的功能保证是什么? 是否存在必须为真的关键列或关键验证?
- 注意: 确定保证被打破时应如何处理非常重要。验证失败是应阻止数据到达最终用户,还是应在不阻止流水线更新的情况下触发警报?您可能希望选择后者,以允许其他最新数据流经流水线。这两种不同情况的实现方式会有所不同。尽早记录这些保证是一个好的做法。
- 这些预期是否相互兼容? 随着复杂系统的增长,它们更容易发生故障。通常,您希望在警报触发后留出足够的时间来解决根本问题——无论是意外的流水线故障还是上游源的数据缺失。
- 示例1: 重要的是要考虑传播延迟与预期流水线刷新频率之间的关系。如果预期刷新频率是每2小时一次,而流水线构建需要1.5小时,则可能难以遵守SLA。
- 示例2: 如果工作流要求每小时刷新一次,但上游数据源每天只提供两次数据。这意味着您将无法实现每小时刷新。
生产流水线的原则¶
建立成功生产流水线的关键原则可以概括为:"以'自己不会长期维护'的思路来构建"。
一些有助于实现这一目标的具体建议:
- 对代码进行版本控制,并编写有意义的提交信息: 这样可以更轻松地追踪谁在何时做了哪些更改。
- 在生产流水线中,如果可能,建议使用Java和Python。这两种语言都有丰富的可用资源和强大的开发者社区。虽然SQL非常易用,但它很容易变得复杂,因此调试起来可能更困难。
- 将可读性置于首位: 简单的流水线更容易维护。在转换代码中选择标准或现有的解决方案,将减少长期的维护负担和开发复杂性。如果必须采用非常规做法,请务必详细记录。
- 尽可能使用线性流水线: 就总体架构而言,更容易维护的流水线是那些相当直接和线性的流水线,但很难给出如何实现这一点的具体建议。注意流水线的整体结构,并在约束条件下尽可能优化可读性和简洁性。当您考虑将流水线投入生产时,花些时间理清流水线通常是值得的。
- 生产流水线通常寿命较长。 因此,编写流水线的人很可能不是长期维护或管理流水线的人。请记住,接手流水线开发或维护的下一个人需要能够阅读代码并理解流水线的设置。
- 保持简洁的文档:
- 逻辑文档: 通常建议将关于逻辑本身的文档放在代码注释中。
- 整体流水线文档(包括记录流水线中的重复问题): 也应放在靠近流水线的直观位置。一个很好的示例位置是流水线所属的项目中。流水线开发者和维护者应能轻松找到。
- 请记住:过度记录也会使文档难以阅读且用处不大。保持简洁,记录关键信息。如果需要记录非常长且详细的、不常被引用的流水线信息,建议将其放在单独的文档中。
开发流程与基础设施设置¶
开发
一旦流水线投入生产,确保建立围绕流水线进一步贡献的开发流程,并有效传达给流水线开发者,这一点至关重要。这可以确保流水线不会发生意外中断。
因此,我们建议阅读:
- 开发最佳实践: 生产流水线至少应具备易读、易维护的特点,并锁定主分支(Master Branch)。但为了确保开发设置更加稳健,我们建议通读本文档中更详细的建议。
- 分支与流水线发布流程: 如果您在流水线开发阶段尚未使用分支和流水线发布流程,我们建议在将流水线投入生产时采用此流程。该文档将提供一个可供参考的示例流程。
基础设施(调度)
与此相关的是,尽早设置调度将使您无需手动触发流水线不同部分的构建。如果调度混乱,您可能会发现这会阻碍开发,并因更改无法自动在流水线中传播而拖慢进度。
在将流水线投入生产时,还建议审查并重新构建调度,因为开发期间使用的调度可能不再合理或包含反模式(Antipattern)。此步骤应在进入维护模式之前完成。
按照最佳实践设置或审查调度,可参考调度最佳实践文档。
尽早开始监控¶
一旦流水线开始定期运行,您就需要开始监控其行为。这可以防止技术债务(Technical Debt)积累,并让您能够追踪流水线的预期是否切合实际。考虑到截止日期和团队能力,可能并不总能立即开始监控,但在可能的情况下鼓励这样做。
有关如何设置监控的更多信息,请参阅流水线监控最佳实践文档。