Troubleshooting schedules(调度(Schedule)排障)¶
Scheduler metrics page¶
One of the best ways to begin troubleshooting a schedule issue is by looking at the scheduler metrics page. The metrics page can tell you the source of your failure, including common failure modes such as:
- The scheduled builds are failing. You will see evidence of failed builds in the
Run Historytab, clicking on these builds will navigate to the Build Report in the Builds application for full logs. - The scheduled builds were ignored. The
Run Historytab will showIgnoredin the Status column for any builds that were triggered, but not built. - The schedule was not triggered at the expected time or cadence. For this case, the
Run Historytab may not show a build was triggered when you expected it to have.
The Versions tab shows past schedule versions and edits, and may be useful if your schedule suddenly begins behaving differently than expected. Check for any changes to the schedule version that align with this change, and consider reverting your schedule to a previously working state.
Scheduled builds are failing¶
You can verify if a schedule was triggered at the expected time by checking the Run History tab on the scheduler metrics page.
If the schedule was triggered, but the build subsequently failed, you can debug this build following the debugging guidance.
A schedule will also fail to build if the appropriate permissions are not set. The permissions of a schedule depends on the token mode the schedule is in. See Project Scoped Schedules for more information.
Scheduled builds were ignored¶
You can verify if a schedule was triggered at the expected time by checking the Run History tab on the scheduler metrics page. This will also usually give the reason that the schedule was ignored.
All datasets are up-to-date¶
A schedule run will be ignored if all the target datasets are up-to-date, i.e., if their inputs have not updated since the last build on that dataset. If this is the case, you will see this reason in the Run History tab. In the schedule editor, navigate to the schedules list. You will then have the option to color the Data Lineage graph by Out-of-date, which will give you an overview of which job specs are considered stale.
This behavior can be overridden in special circumstances using the Force Build option in Advanced Settings, though this is computationally wasteful outside these circumstances. If any of the target datasets any of the datasets build by Phonograph syncs, transforms with API calls or data connection syncs, they may not show up as stale and may require the Force Build option to be enabled for the schedule to run.
Schedule builds a subset of datasets¶
If the schedule only triggers a subset of datasets, you will see evidence of this in the Run History tab on the scheduler metrics page.
One cause of this is that only a subset of the datasets were stale. Scheduler will only build the stale datasets and those that are up-to-date will be ignored during the build. See all datasets are up-to-date for more troubleshooting details. The case where the build is Ignored happens if all these datasets are up-to-date.
Another cause could be that the dataset was not included in the dataset graph of the build. In the schedule editor, when a given schedule is selected, the datasets to be built are highlighted in the Data Lineage graph. The dataset selection depends upon the build type. If using a Connecting build, you should especially take care to verify if a connecting dataset is present for schedules using the same datasets on multiple branches.
Schedule was not triggered¶
You can verify if a schedule was triggered at the expected time by checking the Run History tab on the scheduler metrics page. Some common debugging steps here are:
- Check that the schedule is not paused. Paused schedules will not trigger until unpaused.
- Check the schedule trigger configuration. If this previously was successful, check the schedule history to see if there was a change to the trigger recently.
- If the schedule is using an Event trigger, verify that the expected event actually happened. For example, if the build should be triggered when the input updates, check that the last build on the input ran successfully and transactions on this build were successfully committed in the Dataset Preview History view.
Schedule retries differ from configured¶
Be aware that not all types of failure are retryable. The number of retries when the schedule runs will be capped to a maximum configured by your administrator. See Advanced settings for more information.
Schedule is failing with JobSpecInputsTrashed or JobSpecOutputsTrashed, or Data Lineage warns that some datasets are trashed¶
This means that the schedule contains or reads from a trashed resource. You can do one of the following to resolve:
- Restore the deleted dataset from trash.
- Exclude the deleted dataset from the schedule. If this dataset is used as an input to another downstream dataset in the schedule, you will also need to do one of the following:
- Exclude the downstream dataset along with the trashed one.
- Modify the logic of the downstream dataset so that it no longer takes the trashed dataset as input.
Scheduler permissions¶
If you run into an issue in which you are unable to edit a schedule, project scope permissions may be the root cause.
To edit a schedule in project scoped mode, the user must have Edit permissions on the target datasets, View permissions on the trigger datasets and Edit permissions on the project that the schedule is scoped to. If there is one dataset for which you have lost permissions for, remove this dataset from the schedule before you save your changes.
To edit, delete or pause a schedule, the user needs to have Edit permissions on the target datasets and Edit permissions on the project that the schedule is scoped to. To view a schedule, the user needs to have View permissions on the target datasets.
中文翻译¶
调度(Schedule)排障¶
调度器指标页面(Scheduler metrics page)¶
排查调度问题的首选方法之一是查看调度器指标页面(Scheduler metrics page)。该页面可以展示失败的根本原因,常见失败模式包括:
* 调度构建(Scheduled build)失败:你可以在Run History标签页中查看构建失败的记录,点击对应构建即可跳转至构建应用(Builds)中的构建报告(Build Report)查看完整日志。
* 调度构建被忽略:对已触发但未实际执行的构建,Run History标签页的状态列会显示Ignored。
* 调度未按预期触发:这种情况下,Run History标签页中不会出现你预期时间点的触发记录。
Versions标签页会展示调度的历史版本和编辑记录,如果你的调度突然出现不符合预期的行为,该页面会很有帮助。你可以排查是否有对应时间点的调度版本变更,必要时可以将调度回滚到之前正常运行的版本。
调度构建失败¶
你可以通过查看调度器指标页面的Run History标签页,确认调度是否在预期时间被触发。
如果调度已触发,但后续构建失败,你可以参考调试指南排查构建问题。
权限配置不正确也会导致调度构建失败,调度的权限取决于其使用的令牌模式(Token Mode),详见Project Scoped Schedules。
调度构建被忽略¶
你可以通过查看调度器指标页面的Run History标签页,确认调度是否在预期时间被触发,该页面通常也会展示构建被忽略的原因。
所有数据集(Dataset)均为最新¶
如果所有目标数据集都已是最新版本,即自上次构建后数据集的输入没有发生任何更新,调度运行会被忽略,你可以在Run History标签页看到对应的原因说明。你可以进入调度编辑器(Schedule Editor)的调度列表页,选择按Out-of-date为数据血缘图(Data Lineage Graph)着色,即可直观看到哪些作业规范(Job Spec)被判定为过期。
特殊场景下你可以通过Advanced Settings中的Force Build选项覆盖该逻辑,但非必要场景下开启该选项会浪费计算资源。如果目标数据集包含任意Phonograph同步产出的数据集、带API调用的转换任务产出的数据集、或数据连接同步(Data Connection Sync)产出的数据集,这些数据集可能不会被标记为过期,需要开启Force Build选项才能让调度正常运行。
调度仅构建部分数据集¶
如果调度仅触发了部分数据集的构建,你可以在调度器指标页面的Run History标签页看到对应的记录。
可能的原因之一是只有部分数据集处于过期状态:调度器只会构建过期的数据集,最新状态的数据集会在构建过程中被忽略,更多排障细节可参考所有数据集均为最新。如果所有数据集都处于最新状态,整个构建会被标记为Ignored。
另一个可能的原因是数据集没有被包含在构建的数据集图中:在调度编辑器中选中某一调度时,待构建的数据集会在数据血缘图中高亮显示,数据集的选择范围取决于构建类型(Build Type)。如果你使用的是Connecting build,且调度在多个分支上使用相同的数据集,需要特别注意确认是否存在连接数据集。
调度未按预期触发¶
你可以通过查看调度器指标页面的Run History标签页,确认调度是否在预期时间被触发,常见的调试步骤包括:
* 确认调度未被暂停:已暂停的调度在取消暂停前不会触发运行。
* 检查调度触发配置(Schedule Trigger Configuration):如果调度之前运行正常,可以查看调度历史确认近期是否有触发规则的变更。
* 如果调度使用的是事件触发器(Event Trigger),确认预期的事件确实已发生。例如如果构建应该在输入更新时触发,需要检查输入的上次构建是否成功运行,且该构建对应的事务已在数据集预览历史(Dataset Preview History)视图中成功提交。
调度重试次数与配置不符¶
请注意并非所有失败类型都支持重试,调度运行的最大重试次数由平台管理员配置的上限决定,详见Advanced settings。
调度报错JobSpecInputsTrashed/JobSpecOutputsTrashed,或数据血缘图提示部分数据集已被回收¶
该错误表示调度包含或读取了已被移入回收站的资源,你可以通过以下方法解决: * 从回收站恢复已删除的数据集。 * 将已删除的数据集从调度中移除。如果该数据集是调度中其他下游数据集的输入,你还需要执行以下任意一项操作: * 同时移除下游数据集和已被回收的数据集。 * 修改下游数据集的逻辑,不再将已回收的数据集作为输入。
调度器权限(Scheduler Permission)¶
如果你无法编辑调度,根因可能是项目范围权限配置问题。
要编辑项目范围模式(Project Scoped Mode)下的调度,用户需要拥有目标数据集的Edit权限、触发器数据集的View权限,以及调度所属项目的Edit权限。如果你失去了某一个数据集的权限,需要先将该数据集从调度中移除,再保存修改。
要编辑、删除或暂停调度,用户需要拥有目标数据集的Edit权限,以及调度所属项目的Edit权限。要查看调度,用户需要拥有目标数据集的View权限。