Compute usage: Ontology indexing(计算用量:本体索引(Ontology indexing))¶
Foundry’s Ontology stores objects in an Ontology index, a storage format optimized for rapid access. Data in Foundry datasets can be of any size or format, meaning a data transformation is required to prepare dataset data for storage in an Ontology index. This process is known as Ontology indexing and can be applied to datasets and objects of arbitrary size. The processing cost of Ontology indexing is measured compute-seconds. This documentation describes how Ontology indexing uses compute as well as how to manage compute usage.
Measuring compute usage from Ontology indexing¶
Ontology Indexing uses a parallelized Spark backend to read arbitrarily large sets of data and transform them into the Ontology format. The amount of compute that is used to run an indexing job is based on the amount of computational resources (driver and executors) and the total wall-clock duration of the indexing job itself.
For more information on how Spark usage translates to compute-seconds, see the main Usage Types documentation. Below, you can find examples of the calculations for compute-seconds used by Ontology Indexing.
Investigating usage from Ontology Indexing¶
Ontology indexing jobs are exposed in Foundry’s Builds application and are attached to the object that is being indexed. Ontology indexing jobs are Spark jobs and so are classified as parallelized batch compute; thus, Ontology indexing jobs can be measured in the same way as other jobs on the same backend, such as Code Repositories transforms and Contour queries.
Indexing jobs can be categorized based on how they are triggered.
- Ontology indexing jobs index datasets into the Ontology backend. This compute is used to produce indexed objects from datasets.
- Ontology export jobs persist edits made directly in the Ontology to datasets in the Foundry transformation framework. These jobs tend to be smaller than full indexing jobs as ontology export jobs are generally dealing with edits, which are strict subsets of the total object set.
Drivers of usage for Ontology indexing¶
Ontology indexing jobs must read all of the data that needs to be indexed and transform it into a format that the Ontology backend can store, search, and edit quickly.
Compute usage when reading and indexing data is driven by the following factors:
- Number of records per object
- The number of objects increases as the number of records in the dataset being indexed increases. Each object requires a certain number of computational operations for indexing, so increasing the number of objects increases the amount of compute used for indexing.
- Number of properties per object
- Each property of each object must be individually analyzed by the indexing job and then written into the object index. More compute is used if there are more properties to analyze and index.
- Size of each property
- Some properties are much larger than others. For instance, a text property containing a lot of content will require more space and compute to analyze than a simple number property. Objects with larger, more complex property types will require more compute to index.
Indexing frequency also plays a large role in how much compute is used for Ontology updates. Schedules set on upstream datasets will trigger auto-reindexes of objects. When examining the usage implications of keeping an object up-to-date, consider the update schedules on that object and its upstream datasets.
Managing Ontology indexing compute¶
Ontology indexing jobs can be optimized to reduce compute usage. The first and simplest method of optimization is to reduce the size of the input data for the index, which decreases the amount of work needed to complete the job. This involves doing the following where possible:
- Managing the number of input records
- Managing the number of properties per object
- Managing the size of each property per object
Another optimization method is configuring Ontology index jobs to use changelog strategies for indexing. Changelog indexing significantly reduces the number of objects that need to be created or updated per indexing job by comparing the job against existing objects prior to execution. Changelog indexing requires more configuration and adherence to an update strategy, but can produce orders-of-magnitude performance and efficiency gains.
Example indexing compute calculation¶
Indexing jobs take the form of parallelized Spark jobs and can be seen in the Builds application. See the following example for an indexing job. Note that Ontology indexing jobs will automatically choose the size of the driver and executors for the indexing job, depending on the size of the job.
Driver:
num_vcpu: 1
GiB_RAM: 6
Executors:
num_vcpu: 1
GiB_RAM: 4
num_executors: 2
Total Runtime: 10 seconds
Calculation:
driver_compute_seconds = max(num_vcpu, GiB_RAM / 7.5) * runtime_in_seconds
= max(1vcpu, 6GiB / 7.5) * 10sec
= 1 * 10 = 10 compute-seconds
executor_compute_seconds = max(num_vcpu, GiB_RAM / 7.5) * num_executors * runtime_in_seconds
= max(1vcpu, 4GiB / 7.5) * 2executors * 10sec
= 1 * 2 * 10 = 20 compute-seconds
total_compute_seconds = driver_commpute_seconds + exeucutor_compute_seconds
= 10 compute-seconds + 20 compute-seconds
= 30 compute-seconds
中文翻译¶
计算用量:本体索引(Ontology indexing)¶
Foundry 的本体(Ontology)将对象存储在本体索引(Ontology index)中,这是一种针对快速访问而优化的存储格式。Foundry 数据集中的数据可以是任意大小或格式,因此需要经过数据转换才能将数据集数据存储到本体索引中。这一过程称为本体索引,可应用于任意大小的数据集和对象。本体索引的处理成本以计算秒(compute-seconds)为单位计量。本文档将介绍本体索引如何使用计算资源,以及如何管理计算用量。
本体索引的计算用量计量¶
本体索引使用并行化的 Spark 后端来读取任意大规模的数据集,并将其转换为本体格式。运行索引作业所使用的计算量取决于计算资源(驱动程序 driver 和执行器 executor)的数量以及索引作业的总挂钟时间(wall-clock duration)。
有关 Spark 用量如何转换为计算秒的更多信息,请参阅主要文档用量类型。下文提供了本体索引计算秒的计算示例。
本体索引的用量调查¶
本体索引作业在 Foundry 的构建(Builds)应用程序中可见,并关联到正在被索引的对象。本体索引作业属于 Spark 作业,因此被归类为并行化批处理计算;因此,本体索引作业的计量方式与同一后端上的其他作业(如代码仓库转换和 Contour 查询)相同。
索引作业可根据触发方式进行分类。
- 本体索引作业:将数据集索引到本体后端。此类计算用于从数据集生成已索引的对象。
- 本体导出作业:将直接在本体中进行的编辑持久化到 Foundry 转换框架的数据集中。这些作业通常比完整的索引作业规模更小,因为本体导出作业通常处理的是编辑操作,而编辑操作是全部对象集的严格子集。
本体索引的用量驱动因素¶
本体索引作业必须读取所有需要索引的数据,并将其转换为本体后端能够快速存储、搜索和编辑的格式。
读取和索引数据时的计算用量受以下因素驱动:
- 每个对象的记录数
- 随着被索引数据集中记录数量的增加,对象数量也会增加。每个对象在索引时都需要一定数量的计算操作,因此增加对象数量会增加索引所用的计算量。
- 每个对象的属性数
- 每个对象的每个属性都需要由索引作业单独分析,然后写入对象索引。如果需要分析和索引的属性越多,使用的计算量就越大。
- 每个属性的大小
- 某些属性比其他属性大得多。例如,包含大量内容的文本属性比简单的数字属性需要更多的空间和计算量来进行分析。包含更大、更复杂属性类型的对象需要更多的计算量来进行索引。
索引频率也在本体更新的计算用量中扮演重要角色。上游数据集上设置的调度将触发对象的自动重新索引。在评估保持对象最新状态对用量的影响时,请考虑该对象及其上游数据集的更新调度。
管理本体索引计算¶
本体索引作业可以进行优化以减少计算用量。第一种也是最简单的优化方法是减少索引输入数据的大小,从而减少完成作业所需的工作量。这包括在可能的情况下执行以下操作:
- 管理输入记录的数量
- 管理每个对象的属性数量
- 管理每个对象中每个属性的大小
另一种优化方法是配置本体索引作业以使用变更日志(changelog)策略进行索引。变更日志索引通过在作业执行前将作业与现有对象进行比较,显著减少了每个索引作业需要创建或更新的对象数量。变更日志索引需要更多的配置并遵循更新策略,但可以带来数量级的性能和效率提升。
索引计算示例¶
索引作业采用并行化 Spark 作业的形式,可以在构建应用程序中查看。请参见以下索引作业示例。请注意,本体索引作业会根据作业大小自动选择驱动程序和执行器的大小。
驱动程序:
vCPU 数量:1
GiB 内存:6
执行器:
vCPU 数量:1
GiB 内存:4
执行器数量:2
总运行时间:10 秒
计算:
driver_compute_seconds = max(vCPU 数量, GiB 内存 / 7.5) * 运行时间(秒)
= max(1 vCPU, 6 GiB / 7.5) * 10 秒
= 1 * 10 = 10 计算秒
executor_compute_seconds = max(vCPU 数量, GiB 内存 / 7.5) * 执行器数量 * 运行时间(秒)
= max(1 vCPU, 4 GiB / 7.5) * 2 个执行器 * 10 秒
= 1 * 2 * 10 = 20 计算秒
total_compute_seconds = driver_compute_seconds + executor_compute_seconds
= 10 计算秒 + 20 计算秒
= 30 计算秒