Time series query compute usage(时序查询计算用量)¶
Foundry caches time series data on disk in a highly optimized format for time series queries. Querying this data requires using compute-seconds.
Time series queries use compute in the following ways:
- Queries have a fixed rate of compute-seconds per query.
- Each query has an additional measurement of compute-seconds based on the number of time series points that the query reads.
Queries on larger time series indexes will read more points. The following sections offer a description of how this is calculated.
How compute is measured¶
:::callout{theme="neutral"}
Note that when you are paying for Foundry usage, a fixed, minimum number of compute-seconds are consumed per query. The default amount is 4 compute-seconds. This is the base compute usage required in order to serve a query. If you have an enterprise contract with Palantir, contact your Palantir representative before proceeding with compute usage calculations.
:::
Storing time series data is measured under Foundry storage. Storing time series data does not use any compute; only indexing time series and actively querying the data will use compute.
Time series query compute is used exclusively when querying time series data stored in Foundry. Time series queries use compute in two ways:
- With a minimum number of compute-seconds used per query. When on a usage-based Foundry instance, each query uses 4 compute-seconds.
- Queries scale beyond the fixed minimum cost based on the number of points that they consider in the query itself. The number of points that are scanned during a query is driven by the size of the series being queried and the complexity of the logic being executed, which can drive many comparisons/operations on points.
The following formula derives compute-seconds from a query:
compute-seconds = 4 + points_scanned / 25000
Investigate time series query usage¶
The Resource Management application allows you to drill down to dataset usage information and should be a starting point for investigating usage in the Foundry platform.
Users have access to multiple tools for querying time series data in Foundry. Time series query usage is always attached to the resource that each tool produces or modifies.
- In Quiver, the time series queries will be attributed to the individual Quiver analysis or dashboard that was used to initiate the queries.
- In Workshop, the queries will be attached to the Workshop app that initiated the queries. When a Workshop app embeds a Quiver component that queries time series, the usage is still attributed to the overarching Workshop application.
- For transformations that use the FoundryTS library to query time series, the compute for those queries is attributed to the repository where that transformation was written.
Understand drivers of usage¶
When paying for Foundry on a usage contract, there are three main drivers of compute with time series queries:
- Number of queries
- Each query has a minimum usage of 4 compute-seconds. This means that more queries executed by users, more Quiver dashboards running queries, and more scheduled builds running FoundryTS queries will use more compute-seconds.
- Size of the series being queried
- Reducing the number of points read typically reduces compute and latency. For example, running a query that comprises an aggregation across a series of 10,000,000 points will use more compute than running the same query against a 100,000 point series. However, the actual usage will depend on the query. The most common technique to reduce number of points read is adding a time range filter on each series.
-
Complexity of the query
-
Queries that perform more complex interactions with underlying series will run more operations on top of that series. In running these operations, query usage will increase as more points are scanned.
See the section below for more information and examples of query complexity.
Manage usage with time series¶
Managing the total number of queries is important for managing total compute usage from time series querying. Consider the following practices when using time series in Foundry:
- When building a time series-based analysis or dashboard, consider the total number of queries that must run on each update. Also consider splitting out update paths so not all queries update when parameters are changed.
- When running FoundryTS builds, ensure that the schedules are set appropriately so that builds do not run more often than necessary.
- When a time range is applied to a time series, Foundry's time series backend will only read the points that fall within that time range. Therefore, time series queries in Foundry can be optimized based on the ranges that they are querying. Limiting the time range of a query does not require compute; choosing time ranges can significantly reduce overall points scanned, reducing the compute used in the query.
- Choose the correct granularity for the job; not all workflows require maximally granular data. In some cases, this means pre-aggregating for specific workflows. In other use cases, this means storing different granularities in different series. While maintaining multiple up-to-date series can increase indexing compute, the act of storing multiple series does not require compute. Therefore, holding multiple series of different granularities may be cheaper than always querying the most granular series.
Calculate usage¶
To predict the cost of a time series query, be sure to always understand the size of the series being queried.
Consider the following example, where there are three queries against a series of 100,000 points with each query scanning all points in the series:
series_size: 100,000 points
minimum_query_usage: 4-compute seconds
points_per_compute_second: 25,000 points
total_queries: 3
compute-seconds = num_queries * minimum_query_usage + total_points / points_per_compute_second
compute-seconds = 3 queries * 4 compute-seconds + 100,000 points * 3 queries / 25,000 points-per-second
compute-seconds = 3 * 4 + 300,000 / 25,000
compute-seconds = 24 compute-seconds
Examples of query complexity¶
The complexity of a time series query increases as more nested operations are applied to the queried time series.
As an example, consider the following FoundryTS code which adds two time series together and returns all the points of the new series for a 1-year time range as a Pandas dataframe:
series_1 = N.TimeseriesNode('series_1')
series_2 = N.TimeseriesNode('series_2')
result = F.dsl(program='a+b', return_type=float)([series_1, series_2]).time_range(start='2022-01-01', end='2023-01-01')
result.to_pandas()
This code will make a query to Codex with the shape:
{
id: dsl-fomula
children: [
{ id: timeseries },
{ id: timeseries }
]
}
Evaluating the to_pandas call will incur cost for scanning the points in the result time series in the requested 1-year time range, as well as the points in the two component series required to compute the result (in this case, a 1-year range from each).
Now, consider the following FoundryTS code which applies more nested operations. First, we define a series that is the sum of two other series. Then, we compare that series against its 7-day rolling average, and load one year of points from the result as a Pandas dataframe:
series_1 = N.TimeseriesNode('series_1')
series_2 = N.TimeseriesNode('series_1')
intermediate_1 = F.dsl(program='a+b', return_type=float)([series_1, series_2])
intermediate_2 = intermediate_1.rolling_aggregate('mean', '7d')
result = F.dsl(program='a-b', return_type=float)([intermediate_1, intermediate_2]).time_range(start='2022-01-01', end='2023-01-01')
result.to_pandas()
This code will make a query to Codex with the shape:
{
id: dsl-fomula
children: [
{
id: dsl-fomula
children: [
{ id: timeseries },
{ id: timeseries }
]
},
{
id: rolling-aggregate
children: [
{
id: dsl-fomula
children: [
{ id: timeseries },
{ id: timeseries }
]
}
]
}
]
}
Each node in this query tree will incur cost for scanning the 1-year range of points to produce the final result.
中文翻译¶
时序查询计算用量¶
Foundry 将时序数据以高度优化的格式存储在磁盘上,以支持时序查询。查询这些数据需要使用计算秒数(compute-seconds)。
时序查询通过以下方式使用计算资源:
- 每次查询都有固定的计算秒数费率。
- 每次查询还会根据查询读取的时序数据点数量,额外计算计算秒数。
对较大时序索引(time series indexes)的查询会读取更多数据点。以下章节将说明如何计算这一用量。
计算用量如何衡量¶
:::callout{theme="neutral"}
请注意,当您为 Foundry 使用付费时,每次查询会消耗固定最低数量的计算秒数。默认值为 4 计算秒数。这是服务查询所需的基础计算用量。如果您与 Palantir 签订了企业合同,请在进行计算用量计算前联系您的 Palantir 代表。
:::
存储时序数据按 Foundry 存储(storage)计量。存储时序数据不消耗任何计算资源;只有索引时序数据(indexing time series)和主动查询数据才会消耗计算资源。
时序查询计算仅在查询存储在 Foundry 中的时序数据时使用。时序查询通过两种方式使用计算资源:
- 每次查询使用最低数量的计算秒数。在使用按用量计费的 Foundry 实例上,每次查询使用 4 计算秒数。
- 查询的用量会超出固定的最低成本,具体取决于查询本身所涉及的数据点数量。查询过程中扫描的数据点数量由被查询序列的大小和所执行逻辑的复杂度决定,这可能会对数据点进行大量比较/操作。
以下公式用于从查询中推导计算秒数:
compute-seconds = 4 + points_scanned / 25000
调查时序查询用量¶
资源管理 应用程序允许您深入查看数据集用量信息,应作为调查 Foundry 平台用量的起点。
用户可以使用多种工具在 Foundry 中查询时序数据。时序查询用量始终归属于每个工具生成或修改的资源。
- 在 Quiver 中,时序查询将归属于用于发起查询的单个 Quiver 分析或仪表盘。
- 在 Workshop 中,查询将归属于发起查询的 Workshop 应用。当 Workshop 应用嵌入了查询时序数据的 Quiver 组件时,用量仍归属于主 Workshop 应用程序。
- 对于使用 FoundryTS 库查询时序的转换(transformations),这些查询的计算用量归属于编写该转换的代码仓库(repository)。
了解用量驱动因素¶
当按用量合同为 Foundry 付费时,时序查询的计算用量主要有三个驱动因素:
- 查询数量
- 每次查询的最低用量为 4 计算秒数。这意味着用户执行的查询越多、运行查询的 Quiver 仪表盘越多、以及运行 FoundryTS 查询的定时构建(scheduled builds)越多,消耗的计算秒数就越多。
- 被查询序列的大小
- 减少读取的数据点数量通常可以降低计算用量和延迟。例如,对包含 10,000,000 个数据点的序列运行聚合查询,会比针对 100,000 个数据点的序列运行相同查询消耗更多计算资源。然而,实际用量取决于查询本身。减少读取数据点数量最常用的技术是为每个序列添加时间范围过滤器(time range filter)。
-
查询的复杂度
-
对底层序列执行更复杂交互的查询,会在该序列上运行更多操作。在执行这些操作时,随着扫描更多数据点,查询用量会增加。
更多信息和示例请参见下方关于查询复杂度的章节。
使用时序管理用量¶
管理查询总数对于管理时序查询的总计算用量至关重要。在 Foundry 中使用时序时,请考虑以下实践:
- 在构建基于时序的分析或仪表盘时,请考虑每次更新时必须运行的查询总数。同时考虑拆分更新路径,以便在参数更改时并非所有查询都更新。
- 在运行 FoundryTS 构建时,确保适当设置调度,使构建不会比必要频率更频繁地运行。
- 当时序应用了时间范围时,Foundry 的时序后端将只读取该时间范围内的数据点。因此,Foundry 中的时序查询可以根据查询的时间范围进行优化。限制查询的时间范围不需要消耗计算资源;选择时间范围可以显著减少扫描的总数据点数量,从而降低查询中使用的计算资源。
- 为任务选择正确的粒度(granularity);并非所有工作流都需要最大粒度的数据。在某些情况下,这意味着为特定工作流预先聚合数据。在其他用例中,这意味着在不同序列中存储不同粒度的数据。虽然维护多个最新序列可能会增加索引计算用量,但存储多个序列的行为本身不需要计算资源。因此,持有多个不同粒度的序列可能比始终查询最细粒度的序列更经济。
计算用量¶
要预测时序查询的成本,请务必始终了解被查询序列的大小。
考虑以下示例,其中有三个查询针对一个包含 100,000 个数据点的序列,每个查询扫描序列中的所有数据点:
series_size: 100,000 points
minimum_query_usage: 4-compute seconds
points_per_compute_second: 25,000 points
total_queries: 3
compute-seconds = num_queries * minimum_query_usage + total_points / points_per_compute_second
compute-seconds = 3 queries * 4 compute-seconds + 100,000 points * 3 queries / 25,000 points-per-second
compute-seconds = 3 * 4 + 300,000 / 25,000
compute-seconds = 24 compute-seconds
查询复杂度示例¶
当时序查询应用更多嵌套操作时,其复杂度会增加。
例如,考虑以下 FoundryTS 代码,它将两个时序相加,并以 Pandas 数据框(Pandas dataframe)的形式返回新序列在 1 年时间范围内的所有数据点:
series_1 = N.TimeseriesNode('series_1')
series_2 = N.TimeseriesNode('series_2')
result = F.dsl(program='a+b', return_type=float)([series_1, series_2]).time_range(start='2022-01-01', end='2023-01-01')
result.to_pandas()
此代码将向 Codex 发起一个形状如下的查询:
{
id: dsl-fomula
children: [
{ id: timeseries },
{ id: timeseries }
]
}
评估 to_pandas 调用将产生扫描 result 时序在请求的 1 年时间范围内的数据点,以及计算该结果所需的两个组件序列中的数据点(在本例中,每个序列各取 1 年范围)的成本。
现在,考虑以下应用了更多嵌套操作的 FoundryTS 代码。首先,我们定义一个序列,它是另外两个序列的和。然后,我们将该序列与其 7 天滚动平均值进行比较,并从结果中加载一年的数据点作为 Pandas 数据框:
series_1 = N.TimeseriesNode('series_1')
series_2 = N.TimeseriesNode('series_1')
intermediate_1 = F.dsl(program='a+b', return_type=float)([series_1, series_2])
intermediate_2 = intermediate_1.rolling_aggregate('mean', '7d')
result = F.dsl(program='a-b', return_type=float)([intermediate_1, intermediate_2]).time_range(start='2022-01-01', end='2023-01-01')
result.to_pandas()
此代码将向 Codex 发起一个形状如下的查询:
{
id: dsl-fomula
children: [
{
id: dsl-fomula
children: [
{ id: timeseries },
{ id: timeseries }
]
},
{
id: rolling-aggregate
children: [
{
id: dsl-fomula
children: [
{ id: timeseries },
{ id: timeseries }
]
}
]
}
]
}
此查询树中的每个节点都会产生扫描 1 年范围数据点以生成最终结果的成本。