跳转至

Available rules(可用规则)

Foundry collects resource metadata that is stored internally within Foundry. Linter fetches and combines the metadata from various metadata stores. A Linter rule is logic that runs against resource metadata to determine whether a resource passes or fails the logical test. If a resource fails the test, Linter produces a recommendation as to how the resource can be modified in order to change the result of the logic.

Each Linter rule belongs to a mode.

:::callout{theme="warning"} In Resource Savings mode, each rule follows a methodology for estimating the savings if the recommendation is followed. This methodology supports prioritization; assumptions are often valid for many situations but may not be right for your use case. To validate any reduction in resource usage, use the Resource Management application. :::

Code workbook pipeline crosses schedules

Logic: Downstream datasets are built in Code Workbook but are not on the same schedule.

Why: Code workbooks have the ability to use a ready-to-use warm Spark modules When workbook transform jobs are within the same schedule, they can re-use this module while it is active and skip loading the materialized data from disk. This leads to a faster pipeline and reduced costs.

Recommendation: Move Code Workbook transforms directly downstream from this dataset into the same schedule.

Savings estimate methodology: Linter generates a rough estimate of the time taken to read in the dataset. This estimate is a heuristic based on observational tests and may not be fully accurate for your specific compute environment.

Dataset build time exceeds threshold

Logic: The percentage of time spent with jobs actively running for this resource over the last seven days has exceeded 70%. This could include many jobs with a short duration that are triggered often, or fewer jobs with longer durations.

Why: Datasets constantly building can be a large source of usage. Any frequency decrease or performance improvement on these resources can lead to large savings.

Recommendation: Consider reducing build frequency.

Savings estimate methodology: The total resource usage of the last calendar month (for example, if calculating from June 21st, the value provided is the consumption of this resource between May 1st and May 31st).

Dataset code out of date

Logic: Dataset’s JobSpec commit is different from the latest commit of the master branch in the repository.

Why: If not fixed, the build of the dataset does not reflect the latest logic in the repository.

Recommendation: Remove the dataset from the schedule, or re-commit its transforms logic on the master branch.

Dataset has a filtered projection

Logic: A Contour dashboard does not have a dataset projection on one of its starting incremental datasets.

Why: Use a projection to make the dashboard quicker to populate, saving repetitive filter computations in case of steady usage.

Recommendation: Review the Contour dashboard to confirm its usage, importance, and ordering of filtered columns. Enable dataset projections in your Foundry enrollment, then add a projection on the underlying Foundry dataset that is optimized for filtering on the ordered list of flagged columns. Be sure to schedule your projection so it is regularly refreshed. Your Contour dashboard should now populate faster by leveraging the dataset projection.

Dataset should be view

Logic: The statistics of the input and output datasets of a transform indicate that there is no logic being applied.

Why: Resources are used when reading a dataset into Spark and then writing it back out. This can be avoided by skipping any computation and using a Foundry view.

Recommendation: Turn this dataset into a Foundry view to decrease pipeline runtime, data duplication, and cost.

Savings estimate methodology: The total resource usage of the last calendar month (for example, if calculating from June 21st, the value provided is the consumption of this resource between May 1st and May 31st).

Dataset should use native acceleration

Logic: Transforms associated with the datasets contain an executor memory based profile in their @configure block, and are not already associated with Velox.

Recommendation: It is possible that there are potential savings to be made by switching to Velox backends for these jobs. When making this change, it is important to check the resources consumed to validate any potential savings recommended by Linter.

Default branch unprotected

Logic: The default branch of the code repository is not protected.

Why: Protecting default branches is important for tracking changes, version controls, and protecting against accidental changes.

Recommendation: Add the default branch to the list of protected branches in the repository settings.

Executor high memory ratio

Logic: The transform uses a memory-to-core ratio that is higher than the default of 7.5 GB per core.

Why: There is a default memory-to-core ratio that, when exceeded, leads to increased costs. Many users increase the executor memory in response to problems that are unrelated to memory or could be solved using cheaper means. Resource savings can often be found by performing the transform using the default memory-to-core ratio. In some cases, however, (such as the explosion of rows) it is necessary to increase the memory per executor to avoid out-of-memory errors and/or excessive shuffle between executors. In such cases, increasing the number of executor cores to match the executor memory profile can lead to faster build times and significantly lower costs by increasing parallelism and reducing partition sizes.

Recommendation: We recommend reducing the memory-to-core ratio by aligning the EXECUTOR_MEMORY_X Spark profile with the EXECUTOR_CORES_X Spark profile. If this leads to memory issues, examine your Spark graph to identify the problematic stage. In the case of a dataset read stage issue, consider increasing the number of partitions of the input dataset to feed smaller partition sizes to each executor. In the case of a shuffle stage issue, consider disabling adaptive Spark allocation (ADAPTATIVE_DISABLED Spark profile) if there are less tasks than the shuffle partitions setting or increasing shuffle size (SHUFFLE_PARTITIONS_LARGE Spark profile) to feed smaller tasks to executors. Learn more about Spark Profiles here.

Savings estimate methodology: An estimate is made of what the resource usage of the last calendar month would have been using the default core to memory ratio instead of the detected one. That value is subtracted from the observed resource usage to estimate the saving.

Incremental append dataset too many files

Logic: The dataset is incrementally updating and has a suboptimal partition size, meaning that the average partition size is too large or too small.

Why: The optimal partition size varies. If your partition sizes are too small, the overhead that comes with each partition can dominate the processing time. If your partitions are too large, memory management operations can increase the run time and/or lead to out-of-memory problems that increases job duration and excessive resource usage in downstream datasets or analyses that read in data from a dataset.

Recommendation: We recommend including logic to check partition sizes and storing the data in the dataset in better partition sizes. In the case of an APPEND dataset, check the number of files already in the output dataset. If the dataset is not backed by a Python or Java transform logic file, creating a projection can also remediate this issue. Optimizing the projection for filtering is sufficient to achieve desired compaction, but additional benefits can occur by optimizing for joins, depending on downstream query plans.

Savings estimate methodology: A ratio of the time spent reading in files in recent jobs is compared to an estimate of how long a dataset of such size should usually take. This is multiplied by the number of times this happens over the last week for each downstream transformation that reads in this dataset.

Incremental over-provisioned cores

Logic: The dataset is using static allocation, a Spark setting that uses a fixed number of executors. The parallelism of the task running with the recent jobs is lower than the threshold (default: 70%).

Why: Provisioning resources statically can lead to poor parallelism; resources sit idle and await other tasks to finish before they can be used. Using the dynamic allocation Spark setting can ensure that idle executors are no longer reserved exclusively for this job, leading to better task parallelism.

Recommendation: We recommend using dynamic allocation by applying a DYNAMIC_ALLOCATION_MAX_N Spark profile to the transform.

Savings estimate methodology: The calculated parallelism is multiplied by the resource consumption in the last calendar month (for example, if calculating from June 21st, the value provided is the consumption of this resource between May 1st and May 31st) to estimate the savings if perfect parallelism were achieved. If a dataset has a recent parallelism of 70%, for example, then an estimated 30% of last month’s resource usage can be saved.

Incremental replace dataset too many files

Similar to Incremental append dataset too many files but for datasets that use the replace incremental transform output write mode. We do not recommend using data projections as a fix in this case if UPDATE transactions are not purely adding files; modifying or deleting existing files will cause a full recompute of the dataset projection. To lean more about incremental Python Transforms modes, go here.

Overlapping schedules

Logic: The schedule builds a dataset that is also built by another schedule.

Why: The same dataset is built by two different schedules, potentially causing compute waste, queueing, and unreliable latency.

Recommendation: Change the build action of the schedule(s) to eliminate overlap.

Repository should only allow squash and merge

Logic: In this repository, the Squash and merge git behavior is not enabled.

Recommendation: Enable Squash and merge in the repository settings.

Repository upgrade blocked

Logic: The code repository has an open repository upgrade PR that cannot be merged.

Why: Repository upgrades are essential for the pipeline to keep using the latest functionalities available. Repositories that are not current can cause pipeline failures over time, affecting workflow stability and causing compute waste.

Recommendation: Investigate why the PR is not merging, and fix the upgrade.

Repository upgrades disabled

Logic: A code repository's settings have disabled the automatic merging of upgrade pull requests.

Why: Repository upgrades are essential for pipelines to keep using the latest available functionalities. Repositories that are not current can cause pipeline failures over time, affecting workflow stability and causing compute waste.

Recommendation: Use the fix assistant to allow upgrade PRs to automatically merge. Otherwise, manually navigate to the repository's settings by selecting Settings > Repository > Upgrades and tick the Automatically merge upgrade pull requests option.

Schedule always fails

Logic: The schedule has not succeeded in the past 30 days. The datasets that failed to build in that time are listed.

Why: In each run, the schedule tries to rebuild a dataset that never succeeded in the last 30 days, indicating a waste of resources.

Recommendation: Pause the schedule, apply fixes to the failing dataset(s), or remove the failing dataset(s) from the schedule.

Savings estimate methodology: Sum of computation cost the failing datasets on the branch over the last 30 days.

Schedule builds Code Workbook

Logic: The dataset has a Code Workbook JobSpec.

Why: Datasets in production and scheduled datasets should not be built in Code Workbook where possible; instead, they should be migrated to an application designed for in-production pipeline building such as Pipeline Builder or Code Repositories.

Recommendation: If a dataset needs to be scheduled, migrate the logic to Pipeline Builder or Code Repositories.

Schedule builds Contour analysis

Logic: The dataset has a Contour JobSpec.

Why: Datasets in production and scheduled datasets should not be built in Contour and should be migrated to an application designed for pipeline building, such as Pipeline Builder or Code Repositories.

Recommendation: If a dataset needs to be scheduled, migrate the logic to Pipeline Builder or Code Repositories.

Savings estimate methodology: The sum of computation cost the failing datasets on the branch over the last 30 days.

Schedule builds trashed dataset

Logic: The dataset is in the trash.

Why: Even though a dataset is placed into the trash, the schedule is attempting to build it.

Recommendation: Investigate why the dataset is still in the schedule and either remove it from the trash or update the schedule.

Schedule does not abort on failure

Logic: The schedule does not have abort on failure enabled.

Why: If abort on failure is not enabled, monitoring waits until the schedule to finish even when there is a failed dataset within a schedule.

Recommendation: Enable abort on failure mode. This feature allows monitoring to be notified about the failure sooner and enables faster remediation.

Schedule exceeded duration mode disabled

Logic: The schedule does not have exceeded duration mode enabled.

Why: When a duration is set, this mode prevents Data Connection builds from indefinitely running during a network flake and prevents downstream outages.

Recommendation: Enable exceeded duration mode on the schedule.

Schedule has no built datasets

Logic: The schedule does not build any datasets.

Why: The schedule is potentially a waste of resources and generating noise for monitoring.

Recommendation: Investigate whether the schedule is useful, and decide whether to delete or remove it from the monitoring scope.

Schedule ignores irrelevant datasets

Logic: The schedule declares ignored datasets, which do not modify the build action.

Recommendation: Remove offending datasets from the schedule’s ignore list.

Schedule inputs disconnected

Logic: The schedule specifies input datasets that are not actually inputs for any datasets built by the schedule.

Why: This issue usually occurs due to excluded datasets.

Recommendation: Either modify excluded datasets, or remove the input from the connection build action.

Schedule inputs unscheduled

Logic: The schedule has buildable, unscheduled, not Fusion-backed input datasets.

Recommendation: Schedule the flagged dataset, or remove its code and JobSpec to make it raw. If the dataset needs to have a JobSpec but should not be scheduled (such as discretionary snapshots), add an exception.

Schedule missing description

Logic: The schedule does not have a description.

Recommendation: Enter a description based on the purpose and functionality of the schedule.

Schedule missing name

Logic: The schedule does not have a name.

Recommendation: Generate a name based on the dataset names and paths in the build. This can be automatically generated.

Schedule on branch

Logic: A schedule is running on a branch that is not master.

Why: The master branch is used throughout Foundry as the canonical version of data. Often, schedules should run consistently on branches to provide alternate versions of the data or to experiment with alternative logic. This behavior can result in many schedules that were set up on branches that continue to run longer than needed.

Recommendation: Understand if the schedule on the branch is needed, and consider pausing or deleting the schedule if it is not required. You can also consider reducing the trigger frequency of a schedule not on the master branch to reduce cost. For example, if the schedule runs once a day on master could the non-master branch deliver the same result if run once a week?

Savings estimate methodology: The ratio of the number of jobs that ran on this branch over the past month divided by the number of jobs that ran on master over the past month is used to renormalize the resource usage of datasets built by this schedule. Summing this usage renormalized by branch yields the savings estimate.

Schedule outputs not targets

Logic: Some resolved outputs of the schedule are not marked as targets, or vice versa.

Recommendation: Update the schedule for targets to match actual schedule outputs.

Schedule potentially unused

Logic: The lineage of all resources in the schedule are checked to determine if any recent downstream usage occurred, either in scheduled pipelines, interactive analyses such as Contour, object storage destinations, or via external API calls.

Why: Foundry changes over time, and it is not beneficial to spend resources calculating results that will never be used to make decisions. Turning off unused pipelines is a way to ensure your Foundry instance is only spending resources computing results that are being actively used to drive your goals.

Recommendation: The unused schedule is paused. The recommendation will sometimes reveal unused subsets of the schedule that will likely reduce compute if removed.

Savings estimate methodology: The sum of resource usage across the unused resources in the schedule over the last calendar month.

Schedule retries disabled

Logic: The schedule does not have retries enabled.

Why: Certain failures, such as ones caused by environment setup issues, usually do not appear in the second run. Not having retries in a schedule may cause datasets to not build when they could have been built.

Recommendation: Enable retries. We recommend setting three retry attempts.

Schedule scope invalid

Logic: The schedule has an invalid scope with respect to its build action. Its scope excludes datasets that should be built by this schedule.

Recommendation: Edit the schedule and modify its project scope to include all projects containing datasets to be built by this schedule. If this is not possible, remove the schedule’s project scope.

Schedule trigger is not input

Logic: The schedule declares a dataset as a trigger but not as an input.

Recommendation: Modify the connecting build action to either remove the dataset from triggers or add it to inputs.

Schedule triggers itself

Logic: The schedule declares a dataset as a trigger, which is also being built.

Recommendation: Modify the build action to remove the problematic dataset from triggers or output datasets.

Schedule triggers too often

Logic: Schedule has multiple trigger conditions.

Why: Schedules are often created to run whenever new data is available. For example, if your schedule has two input datasets that each update once an hour, it can lead to a schedule running twice an hour. A significant cost reduction can occur if the schedule runs as multiple datasets update.

Recommendation: Consider changing the schedule to use AND triggers rather than OR, or switching to a time-based trigger to run once an hour.

Schedule unnecessary force builds

Logic: The schedule is using the force build setting even though it is not a Data Connection ingest schedule.

Why: Foundry tracks transactional changes throughout the input lineage with a system that allows skipping computation if inputs are unchanged. The force build setting ensures that a transform runs regardless of the result of the input transaction analysis, leading to potentially unnecessary jobs that repeatedly produce the same result.

Recommendation: Check to see if there are any untracked external dependencies in the transform (API calls, for example) that would require using this setting. If no dependencies are found, remove the setting from the schedule.

Savings estimate methodology: The total resource usage of the last calendar month of the datasets in the schedule (for example, if calculating from June 21st, the value provided is the consumption of this resource between May 1st and May 31st) are multiplied by their job input change percentage. A dataset’s job input change percentage is the number of jobs with updated inputs divided by number of total jobs in the last week (not the previous month over which the consumption data was taken).

Snapshot dataset too many files

Similar to Incremental append dataset too many files but for dataset that are added using SNAPSHOT transactions. We do not recommend using data projections in this case.

Snapshot over-provisioned cores

Similar to Incremental over-provisioned cores but for dataset that are run with SNAPSHOT transactions.

Spark plan potentially non-deterministic

Logic: A query plan with potentially non-deterministic logic was detected for a dataset built by Spark.

Why: There are several edge cases in Spark query plans that may lead to inconsistent, non-deterministic, or otherwise unstable data.

Recommendation: Each recommendation should provide more information on the nature of the potential non-determinism. Investigate and fix the logic backing the dataset that has a potentially non-deterministic Spark plan.

Transform should be lightweight

Logic: Transforms associated with these datasets do not use Spark context or Spark dataframes, or are already associated with Pandas based transforms.

Recommendation: Switching these transforms to use lightweight transforms will result in potential savings. When making this change, it is important to check the resources consumed to validate any potential savings recommended by Linter.

Transform should build incrementally

Logic: Dataset builds as a SNAPSHOT but is directly downstream of at least one incremental dataset that purely appends new data.

Why: A SNAPSHOT transaction processes all historic data. An incremental transaction processes only the new data and adds the results onto the previous values. Switching to incremental transactions can lead to a dramatic reduction in the resources required to deliver an outcome.

Recommendation: Consider converting this dataset to an incremental transform to save compute cost. If the dataset is configured to run incrementally, understand and address why the dataset frequently builds as a SNAPSHOT.

Savings estimate methodology: The total resource usage of the last calendar month (for example, if calculating from June 21st, the value provided is the consumption of this resource between May 1st and May 31st) is multiplied by the ratio of the sum of input sizes (incremental sizes if inputs are incremental) to the sum of total input sizes (regardless of input incrementality). This returns the estimated resource usage of this dataset if it were to run incrementally. Therefore, estimated savings of this recommendation are the difference between past month resource usage and this resource usage estimate.

Transform should build locally

Logic: The transform uses dataset(s) that are small in size (300Mb or less).

Why: By default, Foundry uses distributed computation for ease of scaling and speed. This uses more resources than local computation. In cases where the dataset sizes are small and will remain so, local execution should be used (not distributed) so that fewer resources are needed to perform the transform.

Recommendation: Enable local execution for this transform or all transforms in the repository that have input and output datasets smaller than the threshold size. This can be enabled by using a specific setting, such as the KUBERNETES_NO_EXECUTORS Spark profile.

Savings estimate methodology: The total resource usage of the last calendar month (for example, if calculating from June 21st, the value provided is the consumption of this resource between May 1st and May 31st) is divided by the ratio of core changes. For example, if a transform was previously using five cores (two executors with two cores per executor, and one driver with one drive core) and running locally can use one core, then the estimated savings is the total resource usage multiplied by 0.8 ((5-1)/5).

User-scoped schedule

Logic: The schedule is not project-scoped.

Recommendation: Modify the schedule to run on scoped token mode; by default, the scope is the project RIDs of the built datasets.


中文翻译


可用规则

Foundry 会收集资源元数据(resource metadata)并存储在 Foundry 内部。Linter 从各个元数据存储中获取并整合这些元数据。Linter 规则(Linter rule)是一种针对资源元数据运行的逻辑,用于判断资源是否通过逻辑测试。如果资源未通过测试,Linter 会生成建议,说明如何修改资源以改变逻辑结果。

每条 Linter 规则都属于一个模式(mode)

:::callout{theme="warning"} 在资源节省模式(Resource Savings mode)下,每条规则都遵循一套方法,用于估算遵循建议后可节省的资源。这套方法有助于确定优先级;其假设在许多情况下是有效的,但可能不适用于您的具体用例。如需验证资源使用量的实际减少情况,请使用资源管理应用(Resource Management application)。 :::

代码工作簿管道跨越不同调度

逻辑: 下游数据集在代码工作簿(Code Workbook)中构建,但未使用相同的调度。

原因: 代码工作簿能够使用即用型热 Spark 模块。当工作簿转换作业处于同一调度内时,它们可以在该模块激活时重复使用,从而跳过从磁盘加载物化数据。这可以加快管道运行速度并降低成本。

建议: 将代码工作簿转换直接放置在该数据集的下游,并纳入同一调度。

节省估算方法: Linter 会生成读取该数据集所需时间的粗略估算。此估算基于观察测试的启发式方法,可能不完全适用于您的特定计算环境。

数据集构建时间超过阈值

逻辑: 过去七天内,该资源作业处于活跃运行状态的时间百分比超过 70%。这可能包括许多触发频繁、持续时间短的作业,或少数持续时间较长的作业。

原因: 持续构建的数据集可能是资源使用的一大来源。降低这些资源的构建频率或提升其性能可以带来大量节省。

建议: 考虑降低构建频率。

节省估算方法: 上一个日历月的总资源使用量(例如,如果从 6 月 21 日开始计算,提供的值是该资源在 5 月 1 日至 5 月 31 日期间的消耗量)。

数据集代码已过时

逻辑: 数据集的 JobSpec 提交版本与代码仓库中 master 分支的最新提交版本不同。

原因: 如果不修复,数据集的构建将无法反映仓库中的最新逻辑。

建议: 从调度中移除该数据集,或在 master 分支上重新提交其转换逻辑。

数据集包含过滤投影

逻辑: Contour 仪表盘在其某个起始增量数据集上未设置数据集投影(dataset projection)。

原因: 使用投影可以使仪表盘填充速度更快,在持续使用的情况下避免重复的过滤计算。

建议: 检查 Contour 仪表盘,确认其使用情况、重要性以及过滤列的排序。在您的 Foundry 注册中启用数据集投影,然后在底层 Foundry 数据集上添加一个投影,该投影针对已标记列的排序列表进行过滤优化。请务必安排投影的定期刷新。您的 Contour 仪表盘现在应能通过利用数据集投影更快地填充数据。

数据集应改为视图

逻辑: 转换的输入和输出数据集的统计信息表明未应用任何逻辑。

原因: 将数据集读入 Spark 然后再写回会消耗资源。可以通过跳过任何计算并使用 Foundry 视图(view)来避免这种情况。

建议: 将此数据集转换为 Foundry 视图,以减少管道运行时间、数据重复和成本。

节省估算方法: 上一个日历月的总资源使用量(例如,如果从 6 月 21 日开始计算,提供的值是该资源在 5 月 1 日至 5 月 31 日期间的消耗量)。

数据集应使用原生加速

逻辑: 与数据集关联的转换在其 @configure 块中包含基于执行器内存的配置文件,并且尚未与 Velox 关联。

建议: 将这些作业切换到 Velox 后端可能会带来潜在的节省。进行此更改时,务必检查消耗的资源,以验证 Linter 建议的任何潜在节省。

默认分支未受保护

逻辑: 代码仓库的默认分支未受保护。

原因: 保护默认分支对于跟踪更改、版本控制以及防止意外更改非常重要。

建议: 在仓库设置中将默认分支添加到受保护分支列表中。

执行器内存比例过高

逻辑: 转换使用的内存与核心比例高于默认值(每核心 7.5 GB)。

原因: 存在一个默认的内存与核心比例,超过该比例会导致成本增加。许多用户因与内存无关或可通过更廉价方式解决的问题而增加执行器内存。通过使用默认的内存与核心比例执行转换,通常可以节省资源。然而,在某些情况下(例如行数爆炸),有必要增加每个执行器的内存以避免内存不足错误和/或执行器间过多的数据混洗。在这种情况下,增加执行器核心数以匹配执行器内存配置文件,可以通过提高并行度和减少分区大小来加快构建速度并显著降低成本。

建议: 我们建议通过将 EXECUTOR_MEMORY_X Spark 配置文件与 EXECUTOR_CORES_X Spark 配置文件对齐来降低内存与核心比例。如果这导致内存问题,请检查您的 Spark 图以识别有问题的阶段。如果是数据集读取阶段的问题,请考虑增加输入数据集的分区数,以便向每个执行器提供更小的分区大小。如果是数据混洗阶段的问题,请考虑在任务数少于混洗分区设置时禁用自适应 Spark 分配(ADAPTATIVE_DISABLED Spark 配置文件),或增加混洗大小(SHUFFLE_PARTITIONS_LARGE Spark 配置文件)以向执行器提供更小的任务。在此处了解更多关于 Spark 配置文件的信息:Spark 配置文件

节省估算方法: 估算如果使用默认的核心与内存比例而非检测到的比例,上一个日历月的资源使用量会是多少。然后将该值从观察到的资源使用量中减去,以估算节省量。

增量追加数据集文件过多

逻辑: 数据集正在增量更新,且分区大小不理想,即平均分区大小过大或过小。

原因: 最佳分区大小因情况而异。如果分区大小过小,每个分区带来的开销可能会主导处理时间。如果分区过大,内存管理操作可能会增加运行时间和/或导致内存不足问题,从而增加作业持续时间,并导致下游读取该数据集的数据集或分析中出现过多的资源使用。

建议: 我们建议包含检查分区大小的逻辑,并以更合适的分区大小将数据存储在数据集中。对于 APPEND 数据集,检查输出数据集中已有的文件数量。如果数据集不是由 Python 或 Java 转换逻辑文件支持,创建投影(projection)也可以解决此问题。优化过滤投影足以实现所需的压缩,但根据下游查询计划,优化连接可能会带来额外的好处。

节省估算方法: 将最近作业中读取文件所花费的时间比例与同等大小数据集通常应花费的估算时间进行比较。然后将此比例乘以过去一周内每个读取该数据集的下游转换发生此情况的次数。

增量过度配置核心

逻辑: 数据集使用静态分配(static allocation),这是一种使用固定数量执行器的 Spark 设置。最近作业中运行的任务的并行度低于阈值(默认值:70%)。

原因: 静态配置资源可能导致并行度不佳;资源闲置等待其他任务完成才能被使用。使用动态分配(dynamic allocation)Spark 设置可以确保空闲执行器不再被此作业独占保留,从而实现更好的任务并行度。

建议: 我们建议通过向转换应用 DYNAMIC_ALLOCATION_MAX_N Spark 配置文件来使用动态分配。

节省估算方法: 将计算出的并行度乘以上一个日历月的资源消耗量(例如,如果从 6 月 21 日开始计算,提供的值是该资源在 5 月 1 日至 5 月 31 日期间的消耗量),以估算达到完美并行度时的节省量。例如,如果数据集最近的并行度为 70%,则估计可以节省上个月资源使用量的 30%。

增量替换数据集文件过多

类似于 增量追加数据集文件过多,但针对使用 replace 增量转换输出写入模式的数据集。在这种情况下,如果 UPDATE 事务不仅仅是添加文件,我们不建议使用数据投影作为修复方法;修改或删除现有文件将导致数据集投影的完全重新计算。要了解更多关于增量 Python 转换模式的信息,请访问此处

调度重叠

逻辑: 该调度构建了一个也被其他调度构建的数据集。

原因: 同一数据集被两个不同的调度构建,可能导致计算浪费、排队和不可靠的延迟。

建议: 更改一个或多个调度的构建操作以消除重叠。

仓库应仅允许压缩合并

逻辑: 在此仓库中,未启用 Squash and merge(压缩合并)git 行为。

建议: 在仓库设置中启用 Squash and merge

仓库升级受阻

逻辑: 代码仓库有一个开放的仓库升级 PR,但无法合并。

原因: 仓库升级对于管道持续使用最新的可用功能至关重要。不及时更新的仓库可能会随着时间的推移导致管道故障,影响工作流稳定性并造成计算浪费。

建议: 调查 PR 无法合并的原因,并修复升级问题。

仓库升级已禁用

逻辑: 代码仓库的设置已禁用自动合并升级拉取请求。

原因: 仓库升级对于管道持续使用最新的可用功能至关重要。不及时更新的仓库可能会随着时间的推移导致管道故障,影响工作流稳定性并造成计算浪费。

建议: 使用修复助手允许升级 PR 自动合并。或者,手动导航到仓库的设置,选择 设置 > 仓库 > 升级,然后勾选 自动合并升级拉取请求 选项。

调度始终失败

逻辑: 该调度在过去 30 天内未成功过。列出了在此期间构建失败的数据集。

原因: 每次运行时,该调度都试图重建一个在过去 30 天内从未成功过的数据集,表明存在资源浪费。

建议: 暂停调度,对失败的数据集应用修复,或从调度中移除失败的数据集。

节省估算方法: 过去 30 天内分支上失败数据集的计算成本总和。

调度构建代码工作簿

逻辑: 数据集具有 Code Workbook JobSpec。

原因: 生产环境中的数据集和已调度的数据集应尽可能不在代码工作簿中构建;相反,应迁移到专为生产管道构建而设计的应用程序,例如管道构建器(Pipeline Builder)或代码仓库(Code Repositories)。

建议: 如果数据集需要调度,请将逻辑迁移到管道构建器或代码仓库。

调度构建 Contour 分析

逻辑: 数据集具有 Contour JobSpec。

原因: 生产环境中的数据集和已调度的数据集不应在 Contour 中构建,应迁移到专为管道构建而设计的应用程序,例如管道构建器或代码仓库。

建议: 如果数据集需要调度,请将逻辑迁移到管道构建器或代码仓库。

节省估算方法: 过去 30 天内分支上失败数据集的计算成本总和。

调度构建已删除的数据集

逻辑: 数据集位于回收站中。

原因: 即使数据集被放入回收站,调度仍尝试构建它。

建议: 调查该数据集为何仍在调度中,并将其从回收站中移除或更新调度。

调度未在失败时中止

逻辑: 调度未启用 abort on failure(失败时中止)。

原因: 如果未启用 abort on failure,即使调度中存在失败的数据集,监控也会等待调度完成。

建议: 启用 abort on failure 模式。此功能允许监控更早地收到失败通知,并实现更快的修复。

调度超时模式已禁用

逻辑: 调度未启用 exceeded duration(超时)模式。

原因: 设置持续时间后,此模式可防止数据连接(Data Connection)构建在网络波动时无限期运行,并防止下游中断。

建议: 在调度上启用 exceeded duration mode

调度没有构建任何数据集

逻辑: 调度未构建任何数据集。

原因: 该调度可能浪费资源并为监控产生噪音。

建议: 调查该调度是否有用,并决定是删除它还是将其从监控范围中移除。

调度忽略不相关的数据集

逻辑: 调度声明了被忽略的数据集,但这些数据集并未修改构建操作。

建议: 从调度的忽略列表中移除有问题的数据集。

调度输入已断开连接

逻辑: 调度指定的输入数据集实际上并非调度构建的任何数据集的输入。

原因: 此问题通常由排除的数据集引起。

建议: 修改排除的数据集,或从连接构建操作中移除该输入。

调度输入未调度

逻辑: 调度具有可构建、未调度且非 Fusion 支持的输入数据集。

建议: 调度被标记的数据集,或移除其代码和 JobSpec 使其变为原始状态。如果数据集需要有 JobSpec 但不应被调度(例如, discretionary snapshots),请添加例外。

调度缺少描述

逻辑: 调度没有描述。

建议: 根据调度的目的和功能输入描述。

调度缺少名称

逻辑: 调度没有名称。

建议: 根据构建中的数据集名称和路径生成名称。这可以自动生成。

调度在分支上运行

逻辑: 调度正在 master 之外的分支上运行。

原因: master 分支在整个 Foundry 中被用作数据的规范版本。通常,调度应在分支上一致运行,以提供数据的替代版本或试验替代逻辑。这种行为可能导致许多在分支上设置的调度运行时间超过所需时间。

建议: 了解分支上的调度是否必要,如果不需要,考虑暂停或删除该调度。您还可以考虑降低非 master 分支上调度的触发频率以降低成本。例如,如果调度在 master 上每天运行一次,非 master 分支如果每周运行一次是否能提供相同的结果?

节省估算方法: 使用过去一个月在该分支上运行的作业数量除以过去一个月在 master 上运行的作业数量的比例,来重新归一化此调度构建的数据集的资源使用量。将按分支重新归一化的使用量相加,即得到节省估算值。

调度输出不是目标

逻辑: 调度的一些已解析输出未被标记为目标,反之亦然。

建议: 更新调度的目标,使其与实际调度输出匹配。

调度可能未被使用

逻辑: 检查调度中所有资源的谱系,以确定是否存在任何近期的下游使用情况,无论是在调度的管道中、交互式分析(如 Contour)中、对象存储目标中,还是通过外部 API 调用。

原因: Foundry 会随时间变化,将资源花费在计算永远不会用于决策的结果上是不利的。关闭未使用的管道是确保您的 Foundry 实例仅将资源花费在计算积极用于推动目标的结果上的一种方式。

建议: 暂停未使用的调度。该建议有时会揭示调度中未使用的子集,移除这些子集可能会减少计算量。

节省估算方法: 上一个日历月内调度中未使用资源的资源使用量总和。

调度重试已禁用

逻辑: 调度未启用重试。

原因: 某些故障,例如由环境设置问题引起的故障,通常不会在第二次运行时出现。调度中没有重试可能导致本可以构建的数据集无法构建。

建议: 启用重试。我们建议设置三次重试尝试。

调度范围无效

逻辑: 调度相对于其构建操作具有无效的范围。其范围排除了应由该调度构建的数据集。

建议: 编辑调度并修改其项目范围,以包含所有包含应由该调度构建的数据集的项目。如果无法做到,请移除调度的项目范围。

调度触发器不是输入

逻辑: 调度将数据集声明为触发器,但未将其声明为输入。

建议: 修改连接的构建操作,要么从触发器中移除该数据集,要么将其添加到输入中。

调度触发自身

逻辑: 调度将数据集声明为触发器,而该数据集也正在被构建。

建议: 修改构建操作,从触发器或输出数据集中移除有问题的数据集。

调度触发过于频繁

逻辑: 调度有多个触发条件。

原因: 调度通常被创建为在新数据可用时运行。例如,如果您的调度有两个输入数据集,每个数据集每小时更新一次,则可能导致调度每小时运行两次。如果调度在多个数据集更新时运行,则可以显著降低成本。

建议: 考虑将调度更改为使用 AND 触发器而不是 OR 触发器,或切换到基于时间的触发器以每小时运行一次。

调度不必要的强制构建

逻辑: 调度正在使用强制构建设置,尽管它不是数据连接摄取调度。

原因: Foundry 通过一个系统跟踪整个输入谱系中的事务性更改,该系统允许在输入未更改时跳过计算。强制构建设置确保无论输入事务分析结果如何,转换都会运行,从而导致可能重复产生相同结果的不必要作业。

建议: 检查转换中是否存在任何未跟踪的外部依赖项(例如 API 调用),这些依赖项需要使用此设置。如果未找到依赖项,请从调度中移除该设置。

节省估算方法: 将调度中数据集上一个日历月的总资源使用量(例如,如果从 6 月 21 日开始计算,提供的值是该资源在 5 月 1 日至 5 月 31 日期间的消耗量)乘以其作业输入更改百分比。数据集的作业输入更改百分比是输入已更新的作业数量除以上周作业总数(而非获取消耗数据的前一个月)。

快照数据集文件过多

类似于 增量追加数据集文件过多,但针对使用 SNAPSHOT 事务添加的数据集。在这种情况下,我们不建议使用数据投影。

快照过度配置核心

类似于 增量过度配置核心,但针对使用 SNAPSHOT 事务运行的数据集。

Spark 计划可能具有非确定性

逻辑: 检测到由 Spark 构建的数据集存在可能包含非确定性逻辑的查询计划。

原因: Spark 查询计划中存在几种边缘情况,可能导致不一致、非确定性或其他不稳定的数据。

建议: 每条建议都应提供有关潜在非确定性性质的更多信息。调查并修复支持具有潜在非确定性 Spark 计划的数据集的逻辑。

转换应为轻量级

逻辑: 与这些数据集关联的转换不使用 Spark 上下文或 Spark 数据框,或者已经与基于 Pandas 的转换关联。

建议: 将这些转换切换为使用轻量级转换(lightweight transforms)将带来潜在的节省。进行此更改时,务必检查消耗的资源,以验证 Linter 建议的任何潜在节省。

转换应增量构建

逻辑: 数据集以 SNAPSHOT 方式构建,但直接位于至少一个纯粹追加新数据的增量数据集的下游。

原因: SNAPSHOT 事务处理所有历史数据。增量事务仅处理新数据,并将结果添加到先前的值上。切换到增量事务可以显著减少交付结果所需的资源。

建议: 考虑将此数据集转换为增量转换以节省计算成本。如果数据集配置为增量运行,请了解并解决数据集频繁以 SNAPSHOT 方式构建的原因。

节省估算方法: 将上一个日历月的总资源使用量(例如,如果从 6 月 21 日开始计算,提供的值是该资源在 5 月 1 日至 5 月 31 日期间的消耗量)乘以输入大小总和(如果输入是增量的,则为增量大小)与总输入大小总和(无论输入是否为增量)的比率。这将返回该数据集如果增量运行时的估计资源使用量。因此,此建议的估计节省量是过去一个月的资源使用量与此资源使用量估计值之间的差值。

转换应在本地构建

逻辑: 转换使用的数据集大小较小(300Mb 或更少)。

原因: 默认情况下,Foundry 使用分布式计算以方便扩展和提高速度。这比本地计算使用更多资源。在数据集大小较小且将保持较小的情况下,应使用本地执行(而非分布式),以便执行转换所需的资源更少。

建议: 为此转换或仓库中所有输入和输出数据集小于阈值大小的转换启用本地执行。这可以通过使用特定设置来启用,例如 KUBERNETES_NO_EXECUTORS Spark 配置文件

节省估算方法: 将上一个日历月的总资源使用量(例如,如果从 6 月 21 日开始计算,提供的值是该资源在 5 月 1 日至 5 月 31 日期间的消耗量)除以核心更改的比率。例如,如果转换以前使用五个核心(两个执行器,每个执行器两个核心,以及一个驱动程序,一个驱动核心),而在本地运行时可以使用一个核心,则估计节省量为总资源使用量乘以 0.8 ((5-1)/5)。

用户作用域调度

逻辑: 调度不是项目作用域(project-scoped)的。

建议: 修改调度以在作用域令牌模式下运行;默认情况下,作用域是构建的数据集的项目 RID。