Recommended support processes(推荐的支持流程)¶
The process of monitoring a pipeline is usually best managed by implementing on-call rotations. This means that one team member is actively monitoring the pipeline at a time ("on-call"), and responding to pipeline issues (usually in the form of failing health checks or monitoring rules) is their most important priority for the duration of the on-call rotation.
The following steps are recommended for setting up an effective pipeline monitoring team:
- Define support hours for the different pipelines you monitor.
- Is it critical that the pipeline updates correctly overnight? Or on weekends?
- Define an alerting mechanism. More on this below.
- Define a support rotation schedule.
- How long is a support rotation and on what day should hand-off occur?
- (If you're using an external tool such as Pagerduty) Do you need a secondary on-call rotation schedule in case the primary person on-call misses an alert?
- Prepare pipeline documentation.
- It should be easy to find where documentation lives. An example of a good location is to keep the documentation close to the pipeline in Foundry, such as in a top-level
documentationfolder of the Project where the key outputs of the pipeline live. - Documentation should include:
- An overview of the purpose of your pipeline, how the outputs are used, and the organizational expectations and SLAs.
- An up-to-date Data Lineage graph for your pipeline.
- Note all upstream support teams that the on-call team member may need to contact if there is an issue. Examples to include: platform support teams, Palantir pipeline support teams, other teams at your organization that are providing data for your pipeline.
- Make a note of an escalation path (how does the on-call person escalate if there is an urgent issue that needs immediate attention that they cannot resolve themselves?)
- A section on recurring issues in your pipeline so that yourself or the next on-call person can easily identify an issue that has occurred before and apply the same fix. Some teams may choose to log all technical issues that occur, but consider that over-documenting can make it more difficult to find information quickly.
- Define an SOP for hand-off between on-call rotations
- Will handover be scheduled at a regular time?
- How will you track long-running issues that span across different team members' on-call rotations?
- Whose responsibility will it be to track and resolve long-running issues? Does the on-call team member who first triaged the issue work on the issue when she is not on-call, or does the issue get handed off to whoever is actively on-call?
- Define a process to communicate downtime and maintenance to downstream consumers. This helps minimize the risk of downstream pipeline maintainers being alerted for non-issues.
Alerting Mechanisms¶
An alerting mechanism allows you to respond reactively to health checks failing in your pipeline. This alleviates the need to periodically check a Data Lineage graph, dashboard, or report to see what the status of your pipeline is. Choosing the appropriate alerting mechanism depends on the scale of alerts and how tight your SLAs are (as this dictates how critical response time is).
The available options for automated alerting include:
- Subscribing to all individual health checks in your pipeline. This way you will receive Foundry and email notifications if you have this notification setting turned on. However, this method may be difficult to maintain manually.
- Subscribing to a monitoring view to monitor a group of health checks and monitoring rules.
- Notification frequency can be customized in the monitoring view's subscription settings.
- Integrating monitoring views with an external alerting tool.
- Some tools help by also managing on-call rotations and schedules, including secondary on-call rotations.
- Some tools can provide further flexibility and customization with alerting settings. Customizing how a notification escalates if it is not responded to can be especially useful if you have a very critical pipeline with strict time-based SLAs.
- Learn more about available integrations with external tools.
Regardless of which option you implement, it is useful to implement filters so that you don't miss the alerts among other Foundry platform notifications.
中文翻译¶
推荐的支持流程¶
管道(pipeline) 监控工作通常通过落地轮值值守(on-call rotation) 机制实现最优管理。该机制下,每次有一名团队成员主动监控管道运行状态(即“值守中”),在其值守周期内,响应管道问题(通常表现为失败的健康检查(health check) 或监控规则(monitoring rule) 触发告警)是该成员的最高优先级工作。
建议遵循以下步骤搭建高效的管道监控团队:
- 为你所监控的不同管道定义支持时段。
- 管道在夜间、周末的正确更新是否属于核心需求?
- 定义告警机制(alerting mechanism)。更多内容可查看下文。
- 制定支持轮值排班表。
- 单轮值守的时长是多久?交接班应安排在周几?
- (如果你在使用Pagerduty等外部工具)是否需要设置二级轮值值守排班,以防主值守人员错过告警?
- 准备管道文档。
- 文档的存放位置应当便于查找。合理的存放方案示例是将文档放在Foundry中靠近对应管道的位置,比如存放在管道核心输出所属项目顶层的
documentation文件夹下。 - 文档应当包含以下内容:
- 管道用途概述、产出的使用方式、企业侧的预期要求与服务等级协议(SLA, Service Level Agreement)。
- 管道的最新数据血缘(Data Lineage)图。
- 标注所有值守人员遇到问题时可能需要联系的上游支持团队,示例包括:平台支持团队、Palantir管道支持团队、企业内部为该管道提供数据的其他团队。
- 明确升级路径:如果值守人员遇到无法自行解决、需要立即处理的紧急问题,应当按照什么流程升级上报?
- 新增管道常见问题章节,便于当前或下一轮值守人员快速识别历史发生过的问题,复用过往解决方案。部分团队会选择记录所有发生过的技术问题,但需注意过度记录反而会提升信息快速检索的难度。
- 定义轮值交接班的标准操作程序(SOP, Standard Operating Procedure)
- 交接班是否会安排在固定时间?
- 如何跟踪跨多个值守人员轮值周期的长期遗留问题?
- 跟踪并解决长期遗留问题的责任归属是谁?是首次对问题进行分诊的值守人员在非值守时段继续跟进,还是将问题移交至当前正在值守的人员处理?
- 定义向下游使用者同步停机和维护信息的流程。这有助于降低下游管道维护人员收到非故障告警的风险。
告警机制¶
告警机制可帮助你对管道内的健康检查失败事件做出被动响应,无需你定期查看数据血缘图、看板或报表来确认管道运行状态。选择合适的告警机制取决于告警规模大小,以及你的SLA要求严格程度(该要求决定了响应时间的重要性等级)。
可用的自动化告警选项包括:
- 订阅管道内所有独立健康检查。开启对应通知设置后,你将收到Foundry站内通知和邮件通知。但这种方法的手动维护成本较高。
- 订阅监控视图(monitoring view),以批量监控一组健康检查和监控规则。
- 通知频率可在监控视图的订阅设置中自定义。
- 将监控视图与外部告警工具集成。
- 部分工具还可支持管理轮值值守和排班,包括二级轮值值守设置。
- 部分工具可提供更灵活的告警设置自定义能力。如果你的管道非常关键,有严格的时间类SLA要求,自定义“未响应告警的升级规则”功能会尤其实用。
- 了解更多可用的外部工具集成方案。
无论你选择哪种方案,建议同步配置告警过滤规则,避免你在其他Foundry平台通知中遗漏关键告警。