Foundry usage optimization(Foundry用量优化)¶
Foundry usage optimization best practices¶
The following guide aims to provide methods and best practices for optimizing Foundry usage. This documentation firstly covers how usage in Foundry is determined, and secondly, how to identify usage waste and pipeline optimization. The general recommendations may also be of interest to project managers or platform administrators as they focus on monitoring and optimizing an organization’s usage consumption.
In addition to the best practices listed here, Linter checks the state of Foundry for anti-patterns and provides opinionated recommendations to improve the state of resources. You can evaluate and act on these recommendations to reduce cost, optimize your Ontology, and increase pipeline stability and resilience.
When to optimize¶
As you think about how to implement these best practices for your workflows, it is important to not fall into a well-known pitfall of prematurely optimizing a pipeline or workflow. Users should avoid prematurely optimizing pipelines and should not expect a one-size-fits-all strategy for optimization.
We advise following the mental steps below to check the validity of your approach:
- Is the workflow finished and working? If not, be mindful of premature optimization which could affect the functionality of the workflow.
- Is there a clear objective for this optimization, such as time-to-visualization or cost, which will be used as a success metric to drive further decisions?
- Is there already a defined bottleneck against the objective mentioned above?
If some of these questions are still to be answered, it might indicate that more pre-work is necessary in order to have a successful optimization effort.
With this in mind, below are good and bad examples of optimization efforts:
- [GOOD] A working pipeline is costing $X and an engineer is asked if this can be reduced. Resource Management apps show that most of the costs are related to computing resourcing. Displaying the build frequency, the engineer finds that the schedule runs every day although no one uses it every day. Changing the schedule setup to run less frequently would reduce the consumption by ~28%. After that, the pipeline would no longer be the most expensive on the platform, and the engineer can focus on improving the next bottleneck.
- [BAD] An engineer is asked to design an advanced strategy to optimize the storage cost of a pipeline, requiring significant engineering hours. Once the strategy is implemented, the storage costs decrease but the overall infrastructure bill stayed the same. Why? The bottleneck, storage, was incorrectly identified based on the objective to save costs. Although the storage size of the dataset was unnecessarily high, the low cost of storage relative to compute meant the change only had a minor impact on the total usage cost of the workflow.
General best practices¶
Some general practices for optimizing your Foundry usage include:
- Setting up projects to be able to track usage in the Resource Management application
- Leveraging resource queues
- Using incremental pipelines wherever possible
- Managing schedules to ensure they are running to meet and not exceed your Organization's requirements
- Optimizing Spark usage (depending on whether your level of comfort with implementation)
Step 1: Understand the components of Foundry usage¶
Foundry usage is made up of three components: Foundry compute, Ontology volume, and Foundry storage.
The majority of accounts are on this 3 dimension model; however, usage criteria may vary for some accounts. Review terms with your Palantir representative to confirm.
1. Foundry compute¶
Foundry compute is driven by tools for data integration and analysis. There are three main types of Foundry compute: batch, interactive, and streaming.
Batch compute represents queries or jobs that run in a "batch" capacity, meaning they are triggered to run in the background on a certain scheduled cadence or ad-hoc basis. Batch compute jobs do not consume any compute when they are not being run. A few examples of batch compute include all transform jobs, builds of datasets from Contour, code workbooks, data health checks, and syncs to Ontology/indexed storage.
Interactive compute represents queries that are evaluated in real-time, usually as part of an interactive user session. To provide fast responses to users, interactive compute systems maintain always-on idle compute, which means interactive queries tend to take up more compute than batch evaluation. The main form of interactive compute is Contour - Contour dashboards, analyses, and embedded contour charts all are examples of interactive compute.
Streaming compute represents always-on processing jobs that continuously receive messages and process them using arbitrary logic. Streaming compute is measured for the length of time that the stream is ready to receive messages; streaming compute has the highest cost compared to batch and interactive compute. Examples of streaming compute include streaming transformations and Pipeline Builder streams.
The amount of compute usage for batch, interactive, and streaming are driven by the following factors:
- Data logic: The logic applied to data is one of the biggest factors that impacts Foundry usage and the factor that users can influence the most.
- Transformation speed: Transformation speed is achieved by parallelizing your jobs. Foundry can scale up to thousands of simultaneous machines to quickly tackle massive computation on large datasets. However, faster results and parallelizing jobs can introduce overhead and inefficiencies which might lead to more usage.
- Type of computation: Different computation types require different amounts of compute to execute on the same data. For example, batch processing tends to take less compute than stream data processing because batch processing only uses compute during the runtime while streaming is always-on.
- Data scale and type: The more data there is, the more compute it takes to process it.
- Data freshness: The more often you compute new results and the more scheduled transformations, the more compute it takes to execute.
2. Ontology volume¶
The second component of Foundry usage is Ontology volume. One of Foundry’s unique capabilities is the Ontology layer. An Ontology is a translation layer between enterprise data and the objects your Organization cares about. An Ontology is a categorization of your data world that allows an Organization to think of their data in more tangible terms such as an “aircraft” or “car” rather than aggregations of the many rows and columns that describe them. If you are not familiar with an Ontology, you can learn more from the documentation.
Ontology volume is driven by the following factors:
- Number of Objects: Foundry’s Ontology layer can scale up to billions of Objects per Object type. The total Ontology volume of an Object type is directly related to the total number of Objects.
- Object sizes: Ontology Objects can each have hundreds of properties of arbitrary type. Objects with more or larger properties (for example, long text or images) use more Ontology volume.
- Number of Object-to-Object links: Objects with many links to other Objects can use more Ontology volume due to the size of their link metadata.
3. Foundry storage¶
Foundry storage measures the general purpose data stored in the non-Ontology transformation layers in Foundry, sometimes referred to as "cold storage".
Dataset branches and previous transactions (and views) impact how much disk space a single dataset consumes. Foundry comes with a variety of retention rules to help you keep your Foundry instance lean. When files are removed from a view with a DELETE transaction, the files are not removed from the underlying storage and thus continue to accrue storage costs. The only way to reduce size is to use Retention to clean up unnecessary transactions. Committing a DELETE transaction or updating branches do not reduce the storage used.
Having a clear understanding of what makes up Foundry usage and what impacts it can provide you insight into optimization opportunities.
| Foundry application | Usage impact type |
|---|---|
| Code Repositories | Foundry compute |
| Pipeline Builder | Foundry compute |
| Code Workbooks | Foundry compute |
| Contour | Foundry compute |
| Live models | Foundry compute |
| Ontology | Ontology volume |
| Dataset | Foundry storage |
Step 2: Understanding how to track Foundry usage¶
The Resource Management application provides visibility and transparency for an Organization to understand their Foundry usage consumption. The application enables users to see Foundry usage consumption broken out by each Foundry usage type (Foundry compute, Ontology volume, and Foundry storage). A user can look at usage by resource (Project), source (application), and user.
When trying to identify where Foundry Usage can be optimized, the first place to check is the Resource Management application. This allows you to see what resources are taking up the most compute and identify where you have bottlenecks. From here, you can leverage Foundry usage optimization best practices to identify ways to potentially reduce usage, but always remember - focus on the bottlenecks.
Step 3: Set up Projects to be able to track Foundry usage¶
Projects and the Resource Management application¶
As mentioned above, compute resources are managed at the Project-level by default within Resource Management (RMA); within RMA, we see Foundry compute, Ontology volume, and Foundry storage metrics measured per Project. Therefore, proper Project set-up is absolutely crucial in order to effectively track usage metrics across a data pipeline. A proper set-up will enable data engineers or platform administrators to monitor these usage metrics at key phases of the pipeline to identify areas to optimize. An improper set-up will result in a failure to identify resource-heavy and computationally expensive pieces of a data pipeline.
Recommended Project structure for tracking Foundry usage¶
Foundry projects should be used to enable a properly structured data pipeline. The best practices for project set-up along with pipeline stages are covered in-depth in the recommended Project and team structure documentation. Ensuring that projects follow the recommended structure, from importing the raw data from datasources to an actual workflow, will enable users to analyze compute and storage metrics along key phases of a pipeline.
Permissions¶
When looking for usage reduction strategies, administrators should consider who on their team should have access to create Projects and resources. Restricting this access to the smallest possible number of individuals who are educated on set-up best practices will allow for less spread of Projects and resources, cutting down on unnecessary storage and compute. Allowing any user to create Projects on the platform will likely result in Projects created against the recommended structure for tracking usage, leading to unnecessary and expensive pipelines that ultimately drive up usage. Organizations may manage their Project creation access differently based on the number of users on the platform and data access restrictions; developing a process for determining this access and educating those with access on proper Project structure is recommend to enable proper usage monitoring for these resources.
Step 4: Resource queues¶
A key feature within the Resource Management application that enables you to control your spending in Foundry is resource queues. In order to constrain the amount of compute power associated with a specific project or multiple Projects, you can bundle Projects into queues. Each queue will be assigned a specific resource limit that defines the number of maximum vCPUs used at once. For example, you can assign XXX vCPUs to a given queue which will be the maximum number of vCPUs running at any given time for the Projects assigned. This will ensure that you have visibility and awareness to the amount of usage each Project will consume.
Step 5: Incremental pipelines¶
Incremental pipelines are often used to process input datasets that change significantly over time. By avoiding unnecessary computing on all the rows or files of data that have not changed, incremental pipelines enable lower end-to-end latency while minimizing compute costs. The way to execute this is by understanding the difference between a SNAPSHOT and APPEND transaction.
Snapshot transaction¶
The default transaction type is SNAPSHOT. Snapshot transactions are treated as a replacement of all data in the dataset. That means when you open a dataset where the latest transaction type is SNAPSHOT, the preview will contain only data received in that latest snapshot transaction. The same happens when you try to read that dataset in a data transform or Contour analysis - you will only see data from that latest transaction.
Snapshot is the default transaction type because it’s the easiest one to use - each time your sync runs, it will download all data returned from the database query, and create a snapshot transaction that effectively replaces all data that was in the dataset before. Files present in a previous transaction are of course still available in the historical versions of the dataset, but the preview and the downstream transformation using the data will now access the new transaction by default.
Of course, snapshot transactions are simple to use correctly, but copying all data every time can be very inefficient. One potential efficiency improvement is to use the APPEND transaction type instead.
Append transaction¶
When a dataset consists of append transactions, its default view is a sum of all transactions. This means you do not have to sync files you already uploaded when you use the APPEND transaction type - only the new data is synced into Foundry. This results in a reduction of Foundry storage because each transaction only contains the added files, and NOT a snapshot of everything available in the source system.
Step 6: Manage schedules¶
Another way to optimize Foundry usage is via schedules. Schedules, configured in the Scheduler tool, are used to run builds on a recurring basis to keep data flowing through Foundry consistently.
Schedules should be set up to meet your Organization's requirements, but to optimize Foundry usage it is imperative that schedules are set up efficiently and not running more than necessary. For example, if you set up a schedule for a dataset to refresh at 8 AM every day, but do not actually need updated data at 8 AM every day - your Organization is wasting Foundry usage. Instead, you set your schedule for however frequently you need updated data, for example, every other day at 8 AM. Making this adjustment would halve the amount of Foundry usage.
The two biggest themes to keep in mind when thinking about best practices for optimizing schedules are 1) eliminating duplicate schedules and 2) eliminating unnecessary schedules.
Eliminating duplicate schedules¶
To identify redundant schedules, start by going into the Data Lineage application and changing the node color to Schedule Count. If select nodes have more than one schedule associated with them, select the node and view the Manage schedule tool. There, you will be able to view the associated schedules, determine who owns them, and whether they can be consolidated.
The best practice is to ensure each dataset in your pipeline only has one scheduled build associated with it. Having a dataset built by two different schedules can lead to queuing and a slow-down on both schedules and wasteful batch compute.
Another best practice to reduce redundant schedules, is to avoid full builds and use connecting builds instead. One example is an Ontology pipeline that includes a raw dataset, cleaned dataset, data transforms, and then ultimately ending with an Ontology. Instead of having three schedules set up, one to run on the raw dataset, the second to run the cleaned dataset, and the third to run on the transformed dataset, you only need one schedule where the raw dataset is the trigger and the Ontology dataset is the target.
Eliminating unnecessary schedules¶
To identify unnecessary schedules, go into the Data Lineage application and color the nodes by Time since last built. This allows you to view what data is being updated most frequently and determine whether this is the most optimal frequency for your Organization.
The frequency & timing of schedules is a critical factor in optimizing usage. How frequently does your Organization need data updated?
- If you have a schedule set to daily, consider whether you need updated data on the weekends. If you do not, changing the schedule to only update Monday to Friday could save nearly 30% of Foundry batch compute usage for that pipeline or dataset.
- If you have a schedule set to run three time a day, is your Organization using the refreshed data 3 times a day or is once overnight sufficient?
- Is a time-based trigger needed or will a condition-based trigger work? Do you need to schedule the pipeline to refresh every day at 2 AM or can it refresh only when a certain input dataset refreshes? Event-based triggers tend to be more efficient and save more usage than time-based triggers.
Additionally, it is best to not try and schedule builds all at the same time to ensure debugging will be more efficient and use less compute. When thinking about frequency and timing, it is important to orient back to your Organization's requirement and ensure the refresh rate you are setting complies with what your Organization needs, but not exceeding it.
Lastly, it is important to look at the Advanced options when setting up a schedule. Consider enforcing Abort build on failure to reduce wasted batch compute. You can also update the number of allowed attempts for failed jobs to three or lower, compared to the maximum of 5 attempts. It is also recommended to set the minutes between rebuilds to at least one to three minutes to give time for any glitches that caused the failure to resolve themselves.
Step 7: Optimize Spark¶
Spark is an open-source, distributed cluster-computing framework for quick, large-scale data processing and analytics. Spark makes it easier and quicker for computers to process a lot of data or analytics by splitting up the work between different systems and tackling them in parallel instead of waiting for everything to be completed linearly.
As an initial disclaimer, and as a general best practice when it comes to optimization, we recommend that you manually tweak Spark configurations only if a particular bottleneck is identified, and not as a general practice on the entire platform. This is because new optimization features are progressively added and enabled on Foundry for transforms which do not have manual overrides. A simple example is the introduction of dynamic allocation on Spark 3. While it was previously very important to manually set the number of executors for a transform, nowadays this number is automatically adjusted at execution time to avoid waste.
:::callout{theme="neutral"} Optimizing Spark should only be done by users who are familiar with Spark concepts and have a strong understanding of how it is used within a pipeline. :::
A first step to optimizing Spark is reviewing and understanding Spark profiles. A “Spark profile” is the configuration that Foundry uses to configure distributed compute resources (such as drivers and executors) with the appropriate amount of CPU cores and memory. Most of the time, we recommend using automatic profiles rather than attempting to tweak manually; however, sometimes you may be able to identify the use of a large driver that is not necessary.
You can use the Spark usage coloring on Data Lineage to identify datasets that may be using higher profiles.
Below are some core Spark concepts and terminology to understand before starting to optimize.
- Driver cores: Controls how many CPU cores are assigned to a Spark driver.
- Driver memory (JVM, not off-heap): Controls how much memory is assigned to the Spark driver.
- Executor cores: Controls how many CPU cores are assigned to each Spark executor which in turn controls how many tasks are run concurrently in each executor.
- Executor memory: Controls how much memory is assigned to each Spark executor and shared among all tasks it is running.
- Number of executors: Controls how many executors are requested to run the job.
Beyond Spark profiles, the following best practices provide a starting point when looking to optimize Spark for the purpose of reducing Foundry usage.
- Help Spark's query optimizer by reducing the data early in your code.
- Order matters - Re-order operations so that any filtering of the data is done in earlier stages.
- Drop what you do not need - if you wait to drop the data you do not need until the end, this will take up Spark processing power and time.
- Be aware and minimize the data exchange between executors, also called “shuffling”.
- Be aware and minimize the use of Spark actions (such as
count,collect,take). Unlike transformations (such asfilter,select) which are lazily executed, actions put constraints on the computation graph and prevent potential optimizations. - Be aware of the differences between repartition (shuffle and re-organize all the data) and coalesce (merge existing partitions).
- Be aware and watch out for skewness. Many techniques exist to resolve skewness, amongst which is the use of Broadcast join (if the left dataset is small enough) or the technique of “join salting”.
- Be aware and maximize the parallelism of your job. The Spark details interface provides a view on the execution parallelism at each step. Note that if you are using Spark dynamic executor allocation (enabled by default), unnecessary executors will automatically be scaled down when possible, avoiding any waste.
- Be aware of the overhead of starting a Spark task (viewable in the Spark Details interface in Foundry). A large amount of tiny tasks will lead to overhead and unnecessary data exchange. As a rule of thumb, 30 s to 60 s is a decent number (to be put in perspective to the total time of the stage).
If you are trying to decide between multiple single output transforms versus multi-output transforms in your workflow, review our guide on Optimize multi-output transforms.
中文翻译¶
Foundry用量优化¶
Foundry用量优化最佳实践¶
本指南旨在提供优化Foundry用量的方法与最佳实践,首先会说明Foundry用量的核算规则,其次介绍如何识别用量浪费与进行流水线(pipeline)优化。其中的通用建议也适用于负责监控、优化组织用量消耗的项目经理或平台管理员参考。
除了此处列出的最佳实践外,Linter还会检查Foundry运行状态中的反模式,并给出针对性的资源状态优化建议。你可以参考这些建议评估并落地优化动作,以降低成本、优化本体(Ontology)、提升流水线稳定性与容错能力。
优化时机¶
在考虑如何为工作流落地这些最佳实践时,请注意不要陷入流水线或工作流过早优化的常见误区。用户应避免过早优化流水线,也不要期待存在万能的优化策略。
我们建议你按照以下思路判断优化动作的合理性: * 工作流是否已开发完成且可正常运行?如果没有,过早优化可能会影响工作流的功能实现,需谨慎操作。 * 本次优化是否有明确的目标,比如可视化耗时、成本等可以作为成功度量标准、支撑后续决策的指标? * 针对上述目标,是否已经明确了现存的瓶颈点?
如果上述问题还没有全部得到答案,说明你可能需要完成更多前置准备工作,才能保证优化取得预期效果。
基于以上原则,以下是优化动作的正反示例: * [正面示例] 一条可正常运行的流水线每月成本为X美元,工程师被要求评估是否可以降低成本。通过资源管理(Resource Management)应用可以看到,大部分成本来自计算资源消耗。查看构建频率后工程师发现,这条流水线配置了每日运行的调度,但实际上并没有人需要每日使用它的输出结果。将调度频率调低后,用量消耗降低了约28%。此时该流水线不再是平台上成本最高的资源,工程师可以转而聚焦解决下一个瓶颈。 * [反面示例] 工程师被要求设计一套高级策略优化某条流水线的存储成本,该工作投入了大量研发工时。策略落地后存储成本确实有所下降,但整体基础设施账单没有变化。原因是什么?因为从降本目标出发,团队错误地将存储识别为了瓶颈。虽然该数据集的存储容量确实超出了必要范围,但存储成本远低于计算成本,因此该调整对工作流的总用量成本影响极小。
通用最佳实践¶
优化Foundry用量的通用实践包括: * 合理配置项目,以便在资源管理应用中追踪用量 * 充分利用资源队列 * 尽可能使用增量流水线 * 合理管理调度,确保调度运行频率满足但不超出组织需求 * 按需优化Spark用量(取决于你对Spark实现的熟悉程度)
步骤1:了解Foundry用量的构成¶
Foundry用量由三部分构成:Foundry计算(Foundry compute)、本体容量(Ontology volume)、Foundry存储(Foundry storage)。
绝大多数账号都采用这三维度的用量核算模型,但部分账号的用量核算规则可能存在差异,请与你的Palantir客户代表核对合同条款确认。
1. Foundry计算¶
Foundry计算由数据集成与分析工具的使用产生,主要分为三类:批处理(batch)、交互式(interactive)、流处理(streaming)。
批处理计算指以“批量”形式运行的查询或任务,通常按固定调度周期在后台触发,或由用户手动触发。批处理任务在不运行时不会消耗任何计算资源。批处理的常见示例包括所有转换任务、Contour数据集构建、代码工作簿运行、数据健康检查、本体/索引存储同步等。
交互式计算指实时响应的查询,通常属于用户交互式会话的一部分。为了给用户提供快速响应,交互式计算系统会维持常在线的空闲计算资源,因此交互式查询通常比批处理消耗更多计算资源。交互式计算的主要载体是Contour——Contour看板、分析报告、嵌入式Contour图表都属于交互式计算的范畴。
流处理计算指持续接收消息、按自定义逻辑处理消息的常在线任务,用量按流任务就绪接收消息的总时长统计,相比批处理和交互式计算成本最高。流处理的示例包括流转换、Pipeline Builder流任务。
批处理、交互式、流处理的计算用量由以下因素决定: * 数据处理逻辑: 应用在数据上的逻辑是影响Foundry用量的最核心因素之一,也是用户可控性最高的因素。 * 转换速度: 转换速度通过任务并行化实现。Foundry可以扩容到数千台机器同时运行,快速完成大规模数据集的计算任务。但更快的执行速度和并行任务会带来额外开销与低效问题,可能导致用量升高。 * 计算类型: 对同一批数据执行不同类型的计算需要的计算资源不同。比如批处理通常比流处理消耗更少计算资源,因为批处理仅在运行时占用计算资源,而流处理需要始终在线。 * 数据规模与类型: 数据量越大,处理所需的计算资源越多。 * 数据新鲜度: 计算新结果的频率越高、调度的转换任务越多,执行所需的计算资源越多。
2. 本体容量¶
Foundry用量的第二部分是本体容量。本体层是Foundry的特有能力之一,本体是企业数据和组织关注的对象(Object)之间的转换层,是对数据世界的分类,让组织可以用更具象的概念(比如“飞机”、“汽车”)描述数据,而不是用描述这些实体的大量行列聚合。如果你不了解本体,可以查阅文档获取更多信息。
本体容量由以下因素决定: * 对象数量: Foundry本体层每个对象类型最多可扩容到数十亿个对象,某个对象类型的总本体容量与对象总数直接相关。 * 对象大小: 每个本体对象可以包含数百个任意类型的属性,属性更多、属性体量更大(比如长文本、图片)的对象会占用更多本体容量。 * 对象间链接数量: 与其他对象存在大量链接的对象会因为链接元数据的体量占用更多本体容量。
3. Foundry存储¶
Foundry存储统计的是Foundry中非本体转换层的通用数据存储,有时也被称为冷存储(cold storage)。
数据集分支、历史事务(和视图)会影响单个数据集的磁盘占用量。Foundry提供了多种保留规则帮助你精简实例存储体量。当你通过DELETE事务从视图中删除文件时,文件不会从底层存储中移除,因此会持续产生存储成本。减少存储用量的唯一方式是使用保留规则清理不必要的事务,提交DELETE事务或者更新分支都不会降低存储占用。
清晰了解Foundry用量的构成和影响因素,可以帮你发现潜在的优化机会。
| Foundry应用 | 用量影响类型 |
|---|---|
| Code Repositories | Foundry计算 |
| Pipeline Builder | Foundry计算 |
| Code Workbooks | Foundry计算 |
| Contour | Foundry计算 |
| Live models | Foundry计算 |
| Ontology | 本体容量 |
| Dataset | Foundry存储 |
步骤2:了解如何追踪Foundry用量¶
资源管理应用为组织提供了Foundry用量消耗的透明可视化能力,用户可以在应用中查看三类Foundry用量(Foundry计算、本体容量、Foundry存储)的明细,还可以按资源(项目)、来源(应用)、用户维度拆分查看用量。
当你尝试识别可优化的Foundry用量时,首先要查看资源管理应用,它可以帮你看到哪些资源占用了最多计算资源、识别瓶颈点。之后你可以参考Foundry用量优化最佳实践找到潜在的降用量方法,但请始终记住——聚焦瓶颈点。
步骤3:配置项目以支持Foundry用量追踪¶
项目与资源管理应用¶
如上所述,默认情况下资源管理应用(RMA)中的计算资源是按项目维度管理的,应用内的Foundry计算、本体容量、Foundry存储指标都是按项目统计的。因此合理的项目配置对有效追踪数据流水线全链路的用量指标至关重要。合理的配置可以让数据工程师或平台管理员在流水线的关键阶段监控用量指标,识别优化点;不合理的配置会导致你无法定位数据流水线中资源占用高、计算成本高的部分。
追踪Foundry用量的推荐项目结构¶
你应该通过Foundry项目搭建结构合理的数据流水线。项目配置与流水线阶段的最佳实践在推荐项目与团队结构文档中有详细说明。确保项目从数据源导入原始数据到实际工作流的全流程都遵循推荐结构,可以让你按流水线的关键阶段分析计算与存储指标。
权限¶
在制定用量降低策略时,管理员应该考虑团队中哪些人有权限创建项目与资源。应将项目创建权限仅开放给最少数量的、已接受过配置最佳实践培训的人员,避免项目与资源过度分散,减少不必要的存储与计算消耗。如果允许任意用户在平台上创建项目,很可能会出现不符合用量追踪推荐结构的项目,最终产生不必要的高成本流水线,推高用量。组织可以根据平台用户数量、数据访问限制调整项目创建权限的管控规则,我们建议你制定权限审批流程,并为有权限创建项目的人员培训正确的项目结构,以支持这些资源的用量监控。
步骤4:资源队列¶
资源管理应用中帮助你控制Foundry开销的核心功能是资源队列。如果要限制单个或多个项目的计算资源总量,你可以将项目加入不同的队列,每个队列会分配特定的资源上限,定义队列的最大并发vCPU用量。比如你可以给某个队列分配XXX vCPU,也就是该队列下所有项目的运行时总vCPU用量不会超过这个数值,这样你就可以清晰感知每个项目的用量规模。
步骤5:增量流水线¶
增量流水线通常用于处理随时间变化较大的输入数据集。增量流水线不需要对未发生变化的所有数据行或文件重复计算,因此可以降低端到端延迟,同时最小化计算成本。要使用增量流水线,你首先需要理解SNAPSHOT和APPEND事务的区别。
快照事务¶
默认的事务类型是SNAPSHOT。快照事务会替换数据集中的所有数据,也就是说当你打开最新事务类型为SNAPSHOT的数据集时,预览仅会包含该最新快照事务中的数据;在数据转换或Contour分析中读取该数据集时,也只会看到该最新事务的数据。
快照是默认事务类型,因为它的使用门槛最低——每次同步运行时,会下载数据库查询返回的所有数据,创建快照事务直接替换数据集中之前的所有数据。旧事务中的文件当然还可以在数据集的历史版本中访问,但预览和使用该数据的下游转换默认会读取新事务的数据。
快照事务的优点是使用简单不易出错,但每次全量拷贝数据的效率很低,一个潜在的效率优化方式是改用APPEND事务类型。
追加事务¶
如果数据集使用追加事务,它的默认视图是所有事务的总和。也就是说使用APPEND事务类型时,你不需要同步已经上传过的文件,仅需要同步新数据到Foundry即可。这会降低Foundry存储用量,因为每个事务仅包含新增文件,而不是源系统全量数据的快照。
步骤6:管理调度¶
另一个优化Foundry用量的方式是管理调度。在调度工具中配置的调度用于定期触发构建,保证数据在Foundry中持续稳定流转。
调度的配置应该满足组织的需求,但为了优化Foundry用量,必须高效配置调度,避免不必要的运行。比如你将某个数据集的刷新调度设为每天早8点,但实际上组织并不需要每天早8点的更新数据,就会造成Foundry用量浪费。你应该按照实际需要的更新频率设置调度,比如每两天早8点运行一次,这个调整可以直接将该流水线的用量减半。
优化调度的最佳实践核心有两点:1)消除重复调度;2)消除不必要调度。
消除重复调度¶
要识别冗余调度,首先打开数据血缘(Data Lineage)应用,将节点颜色设置为Schedule Count(调度数量)。如果某个节点关联了多个调度,选中该节点打开管理调度工具,就可以查看关联的所有调度、确认调度负责人、判断是否可以合并调度。
最佳实践是保证流水线中的每个数据集仅关联一个调度构建。同一个数据集被两个不同的调度触发构建会导致队列阻塞、两个调度的运行速度都变慢,同时浪费批处理计算资源。
另一个减少冗余调度的最佳实践是避免全量构建,使用关联构建替代。比如某个本体流水线包含原始数据集、清洗后的数据集、数据转换,最终输出到本体。不需要设置三个调度分别触发原始数据集、清洗数据集、转换数据集的构建,仅需要设置一个调度,以原始数据集刷新为触发条件,以本体数据集为构建目标即可。
消除不必要调度¶
要识别不必要的调度,打开数据血缘应用,将节点颜色设置为Time since last built(距上次构建时长),这样你可以看到哪些数据更新最频繁,判断该频率对组织是否是最优的。
调度的频率与时机是优化用量的核心因素:你的组织需要多高的数据更新频率? * 如果你设置了每日运行的调度,可以考虑是否需要周末更新数据。如果不需要,将调度改为仅周一到周五运行,可以为该流水线或数据集节省近30%的Foundry批处理计算用量。 * 如果你设置了每天运行3次的调度,组织是否真的需要一天3次的更新数据,还是仅夜间运行一次就足够? * 是需要时间触发,还是条件触发更合适?是否必须每天凌晨2点刷新流水线,还是仅当某个输入数据集刷新时再刷新就可以?事件触发通常比时间触发效率更高,更节省用量。
此外,最好不要将所有构建都安排在同一时间触发,这样可以提升调试效率,减少计算资源浪费。在考虑调度频率和时机时,一定要回归组织的实际需求,确保你设置的刷新频率符合需求,且不超出需求。
最后,设置调度时请务必查看高级选项,建议开启构建失败即中止(Abort build on failure) 减少浪费的批处理计算资源。你还可以将失败任务的最大重试次数调整为3次或更低(默认最高为5次),同时建议将两次重建的间隔设置为至少1-3分钟,给导致失败的临时故障预留恢复时间。
步骤7:优化Spark¶
Spark是开源分布式集群计算框架,用于快速大规模数据处理与分析。Spark通过将计算任务拆分到不同系统并行处理,而不是线性等待所有任务执行完成,大幅提升了大数据处理与分析的速度与易用性。
首先需要说明,作为优化的通用最佳实践,我们建议仅在明确识别到特定瓶颈时才手动调整Spark配置,不要将其作为全平台的通用操作。这是因为Foundry会持续为没有手动覆盖配置的转换任务新增并启用优化特性,一个简单的例子是Spark 3引入的动态资源分配:之前手动设置转换任务的执行器数量非常重要,但现在该数值会在执行时自动调整,避免资源浪费。
:::callout{theme="neutral"} 仅熟悉Spark概念、非常了解Spark在流水线中使用方式的用户才应该进行Spark优化操作。 :::
优化Spark的第一步是查看并理解Spark配置模板。Spark配置模板(Spark profile)是Foundry用于配置分布式计算资源(比如驱动(driver)、执行器(executor))的CPU核数与内存的配置集。大多数情况下我们推荐使用自动配置模板,不需要手动调整;但有时你可能会发现业务使用了不必要的大驱动配置。
你可以在数据血缘中使用Spark用量着色,识别可能使用了高配配置模板的数据集。
在开始优化前,你需要先了解以下核心Spark概念与术语: * 驱动核心数: 分配给Spark驱动的CPU核心数量。 * 驱动内存(JVM堆内,非堆外): 分配给Spark驱动的内存大小。 * 执行器核心数: 分配给每个Spark执行器的CPU核心数量,决定了每个执行器中并发运行的任务数量。 * 执行器内存: 分配给每个Spark执行器的内存大小,由该执行器运行的所有任务共享。 * 执行器数量: 运行任务需要申请的执行器总数。
除了Spark配置模板之外,以下最佳实践可以作为以降低Foundry用量为目标的Spark优化起点:
* 代码中尽早过滤数据,帮助Spark查询优化器降低计算量。
* 操作顺序很重要——调整操作顺序,让数据过滤操作尽可能在早期阶段执行。
* 丢弃不需要的数据——如果等到最后才删除不需要的数据,会占用Spark处理资源与时间。
* 注意最小化执行器之间的数据交换,也就是“洗牌(shuffle)”。
* 注意最小化Spark action的使用(比如count、collect、take)。和延迟执行的transformation(比如filter、select)不同,action会给计算图增加约束,影响潜在的优化空间。
* 了解repartition(全量洗牌重分区)和coalesce(合并现有分区)的区别。
* 注意数据倾斜问题。解决数据倾斜的技术有很多,包括使用广播连接(Broadcast join)(如果左表足够小)或者“连接加盐(join salting)”技术。
* 注意最大化任务的并行度。Spark详情界面可以查看每个执行步骤的并行度。请注意如果你开启了Spark执行器动态分配(默认开启),空闲的多余执行器会自动缩容,避免资源浪费。
* 注意Spark任务的启动开销(可以在Foundry的Spark详情界面查看)。大量的小任务会导致额外开销与不必要的数据交换,一般来说单个任务运行30-60秒是比较合理的范围(可结合阶段总时长调整)。