foundryts.functions.statistics¶
foundryts.functions.statistics(start=None, end=None, window=None, **kwargs)¶
Returns a function that will partition a single series into windows and compute statistics over each window.
The series is partitioned into windows using the window arg. Each window contains statistics listed in the Dataframe schema. The statistics are calculated over rolling windows created either based on the periodicity (width of each window) or a fixed number of windows where width is calculated using round(total number of points / number of buckets). The option used for the rolling window is decided by the window or window_count argument is passed.
- Parameters:
- start (Union [int , datetime , str ] , optional) – Timestamp (inclusive) to start partitioning windows from the provided series. (default is the entire series)
- end (Union [int , datetime , str ] , optional) – Timestamp (inclusive) to end partitioning windows from the provided series. (default is the entire series)
-
window (Union [int , datetime , str ] , optional) –
The timedelta which is the width of each window, and the size of each window is used to divide the series into a : number of windows. (default is the entire series) * **kwargs – Flags for determining the window behavior and the output type. * Keyword Arguments: * include_std_dev (bool , False) – If set to True, the output will include the standard deviation. * window_count (int , optional) – Number of windows to compute the statistics over (instead of the size of each window). * Returns: Returns a function that accepts a single series as input, and partitions it into windows with each window providing statistics over each window. * Return type: (FunctionNode) -> SummarizerNode
Dataframe schema¶
| Column name | Type | Description |
|---|---|---|
| count | int | Number of data points in the window of the input series. |
| earliest_point.timestamp | datetime | Timestamp of the first data point in the window of the input series. |
| earliest_point.value | float | Value of the first data point in the window of the input series. |
| end_timestamp | datetime | Timestamp of the last data point |
| largest_point.timestamp | datetime | Timestamp of the data point with the largest value in the window of the input series. |
| largest_point.value | float | Largest value in the window of the input series. |
| latest_point.timestamp | datetime | Timestamp of the most recent data point in the window of the input series. |
| latest_point.value | float | Value of the most recent data point in the window of the input series. |
| mean | float | Average value of all data points in the window of the input series. |
| smallest_point.timestamp | datetime | Timestamp of the data point with the smallest value in the window of the input series. |
| smallest_point.value | float | Smallest value in the window of the input series. |
| start_timestamp | datetime | Timestamp of the first data point |
:::callout{theme="success" title="See Also"}
distribution(), scatter()
:::
Notes¶
This function is only applicable to numeric series.
In the future, the include_std_dev kwarg will be deprecated as this feature will be made the default.
window_count can only be used with include_std_dev, and this will override window. If passed without include_std_dev, window_count will be ignored.
Examples¶
>>> series = F.points(
... (1, 8.0),
... (101, 4.0),
... (200, 2.0),
... (201, 1.0),
... (299, 35.0),
... (300, 16.0),
... (350, 32.0),
... (1000, 64.0),
... )
timestamp value
0 1970-01-01 00:00:00.000000001 8.0
1 1970-01-01 00:00:00.000000101 4.0
2 1970-01-01 00:00:00.000000200 2.0
3 1970-01-01 00:00:00.000000201 1.0
4 1970-01-01 00:00:00.000000299 35.0
5 1970-01-01 00:00:00.000000300 16.0
6 1970-01-01 00:00:00.000000350 32.0
7 1970-01-01 00:00:00.000001000 64.0
>>> stats = F.statistics(window="100ns")(series) # use time-based window
>>> stats.to_pandas()
count earliest_point.timestamp earliest_point.value end_timestamp largest_point.timestamp largest_point.value latest_point.timestamp latest_point.value mean smallest_point.timestamp smallest_point.value start_timestamp
0 1 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000100 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000001 8.0 8.000000 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000000
1 1 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000200 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000101 4.0 4.000000 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000100
2 3 1970-01-01 00:00:00.000000200 2.0 1970-01-01 00:00:00.000000300 1970-01-01 00:00:00.000000299 35.0 1970-01-01 00:00:00.000000299 35.0 12.666667 1970-01-01 00:00:00.000000201 1.0 1970-01-01 00:00:00.000000200
3 2 1970-01-01 00:00:00.000000300 16.0 1970-01-01 00:00:00.000000400 1970-01-01 00:00:00.000000350 32.0 1970-01-01 00:00:00.000000350 32.0 24.000000 1970-01-01 00:00:00.000000300 16.0 1970-01-01 00:00:00.000000300
4 1 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001100 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001000 64.0 64.000000 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001000
>>> stats_with_std_dev = F.statistics(window="100ns", include_std_dev=True)(series)
>>> stats_with_std_dev.to_pandas()
count earliest_point.timestamp earliest_point.value end_timestamp largest_point.timestamp largest_point.value latest_point.timestamp latest_point.value mean smallest_point.timestamp smallest_point.value standard_deviation start_timestamp
0 1 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000100 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000001 8.0 8.000000 1970-01-01 00:00:00.000000001 8.0 0.000000 1970-01-01 00:00:00.000000000
1 1 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000200 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000101 4.0 4.000000 1970-01-01 00:00:00.000000101 4.0 0.000000 1970-01-01 00:00:00.000000100
2 3 1970-01-01 00:00:00.000000200 2.0 1970-01-01 00:00:00.000000300 1970-01-01 00:00:00.000000299 35.0 1970-01-01 00:00:00.000000299 35.0 12.666667 1970-01-01 00:00:00.000000201 1.0 15.797327 1970-01-01 00:00:00.000000200
3 2 1970-01-01 00:00:00.000000300 16.0 1970-01-01 00:00:00.000000400 1970-01-01 00:00:00.000000350 32.0 1970-01-01 00:00:00.000000350 32.0 24.000000 1970-01-01 00:00:00.000000300 16.0 8.000000 1970-01-01 00:00:00.000000300
4 1 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001100 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001000 64.0 64.000000 1970-01-01 00:00:00.000001000 64.0 0.000000 1970-01-01 00:00:00.000001000
>>> stats_fixed_window_count = F.statistics(include_std_dev=True, window_count=3)(series)
>>> stats_fixed_window_count.to_pandas()
count earliest_point.timestamp earliest_point.value end_timestamp largest_point.timestamp largest_point.value latest_point.timestamp latest_point.value mean smallest_point.timestamp smallest_point.value standard_deviation start_timestamp
0 6 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000335 1970-01-01 00:00:00.000000299 35.0 1970-01-01 00:00:00.000000300 16.0 11.0 1970-01-01 00:00:00.000000201 1.0 11.83216 1970-01-01 00:00:00.000000001
1 1 1970-01-01 00:00:00.000000350 32.0 1970-01-01 00:00:00.000000669 1970-01-01 00:00:00.000000350 32.0 1970-01-01 00:00:00.000000350 32.0 32.0 1970-01-01 00:00:00.000000350 32.0 0.00000 1970-01-01 00:00:00.000000335
2 1 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001003 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001000 64.0 64.0 1970-01-01 00:00:00.000001000 64.0 0.00000 1970-01-01 00:00:00.000000669
中文翻译¶
foundryts.functions.statistics¶
foundryts.functions.statistics(start=None, end=None, window=None, **kwargs)¶
返回一个函数,该函数将单个序列(series)分割成多个窗口(window),并计算每个窗口的统计量。
该序列使用 window 参数分割成多个窗口。每个窗口包含 Dataframe 模式中列出的统计量。统计量基于滚动窗口(rolling window)计算,滚动窗口的创建方式有两种:基于周期性(每个窗口的宽度),或基于固定数量的窗口(此时宽度通过 总数据点数 / 桶数量 的舍入值计算)。滚动窗口的选用方式由 window 或 window_count 参数决定。
- 参数:
- start (Union [int , datetime , str ] , 可选) – 从提供的序列中开始划分窗口的时间戳(包含)。默认为整个序列。
- end (Union [int , datetime , str ] , 可选) – 从提供的序列中结束划分窗口的时间戳(包含)。默认为整个序列。
-
window (Union [int , datetime , str ] , 可选) –
表示每个窗口宽度的时间增量(timedelta),每个窗口的大小用于将序列划分为若干窗口。默认为整个序列。 * **kwargs – 用于确定窗口行为和输出类型的标志。 * 关键字参数: * include_std_dev (bool , False) – 如果设置为 True,输出将包含标准差(standard deviation)。 * window_count (int , 可选) – 用于计算统计量的窗口数量(而非每个窗口的大小)。 * 返回: 返回一个函数,该函数接受单个序列作为输入,并将其分割成多个窗口,每个窗口提供该窗口内的统计量。 * 返回类型: (FunctionNode) -> SummarizerNode
Dataframe 模式¶
| 列名 | 类型 | 描述 |
|---|---|---|
| count | int | 输入序列窗口中的数据点数量。 |
| earliest_point.timestamp | datetime | 输入序列窗口中第一个数据点的时间戳。 |
| earliest_point.value | float | 输入序列窗口中第一个数据点的值。 |
| end_timestamp | datetime | 最后一个数据点的时间戳。 |
| largest_point.timestamp | datetime | 输入序列窗口中最大值数据点的时间戳。 |
| largest_point.value | float | 输入序列窗口中的最大值。 |
| latest_point.timestamp | datetime | 输入序列窗口中最近数据点的时间戳。 |
| latest_point.value | float | 输入序列窗口中最近数据点的值。 |
| mean | float | 输入序列窗口中所有数据点的平均值。 |
| smallest_point.timestamp | datetime | 输入序列窗口中最小值数据点的时间戳。 |
| smallest_point.value | float | 输入序列窗口中的最小值。 |
| start_timestamp | datetime | 第一个数据点的时间戳。 |
:::callout{theme="success" title="另请参阅"}
distribution(), scatter()
:::
说明¶
此函数仅适用于数值型序列。
未来,include_std_dev 关键字参数将被弃用,因为该功能将成为默认行为。
window_count 只能与 include_std_dev 一起使用,并且会覆盖 window 参数。如果在没有 include_std_dev 的情况下传递 window_count,则该参数将被忽略。
示例¶
>>> series = F.points(
... (1, 8.0),
... (101, 4.0),
... (200, 2.0),
... (201, 1.0),
... (299, 35.0),
... (300, 16.0),
... (350, 32.0),
... (1000, 64.0),
... )
timestamp value
0 1970-01-01 00:00:00.000000001 8.0
1 1970-01-01 00:00:00.000000101 4.0
2 1970-01-01 00:00:00.000000200 2.0
3 1970-01-01 00:00:00.000000201 1.0
4 1970-01-01 00:00:00.000000299 35.0
5 1970-01-01 00:00:00.000000300 16.0
6 1970-01-01 00:00:00.000000350 32.0
7 1970-01-01 00:00:00.000001000 64.0
>>> stats = F.statistics(window="100ns")(series) # 使用基于时间的窗口
>>> stats.to_pandas()
count earliest_point.timestamp earliest_point.value end_timestamp largest_point.timestamp largest_point.value latest_point.timestamp latest_point.value mean smallest_point.timestamp smallest_point.value start_timestamp
0 1 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000100 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000001 8.0 8.000000 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000000
1 1 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000200 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000101 4.0 4.000000 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000100
2 3 1970-01-01 00:00:00.000000200 2.0 1970-01-01 00:00:00.000000300 1970-01-01 00:00:00.000000299 35.0 1970-01-01 00:00:00.000000299 35.0 12.666667 1970-01-01 00:00:00.000000201 1.0 1970-01-01 00:00:00.000000200
3 2 1970-01-01 00:00:00.000000300 16.0 1970-01-01 00:00:00.000000400 1970-01-01 00:00:00.000000350 32.0 1970-01-01 00:00:00.000000350 32.0 24.000000 1970-01-01 00:00:00.000000300 16.0 1970-01-01 00:00:00.000000300
4 1 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001100 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001000 64.0 64.000000 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001000
>>> stats_with_std_dev = F.statistics(window="100ns", include_std_dev=True)(series)
>>> stats_with_std_dev.to_pandas()
count earliest_point.timestamp earliest_point.value end_timestamp largest_point.timestamp largest_point.value latest_point.timestamp latest_point.value mean smallest_point.timestamp smallest_point.value standard_deviation start_timestamp
0 1 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000100 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000001 8.0 8.000000 1970-01-01 00:00:00.000000001 8.0 0.000000 1970-01-01 00:00:00.000000000
1 1 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000200 1970-01-01 00:00:00.000000101 4.0 1970-01-01 00:00:00.000000101 4.0 4.000000 1970-01-01 00:00:00.000000101 4.0 0.000000 1970-01-01 00:00:00.000000100
2 3 1970-01-01 00:00:00.000000200 2.0 1970-01-01 00:00:00.000000300 1970-01-01 00:00:00.000000299 35.0 1970-01-01 00:00:00.000000299 35.0 12.666667 1970-01-01 00:00:00.000000201 1.0 15.797327 1970-01-01 00:00:00.000000200
3 2 1970-01-01 00:00:00.000000300 16.0 1970-01-01 00:00:00.000000400 1970-01-01 00:00:00.000000350 32.0 1970-01-01 00:00:00.000000350 32.0 24.000000 1970-01-01 00:00:00.000000300 16.0 8.000000 1970-01-01 00:00:00.000000300
4 1 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001100 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001000 64.0 64.000000 1970-01-01 00:00:00.000001000 64.0 0.000000 1970-01-01 00:00:00.000001000
>>> stats_fixed_window_count = F.statistics(include_std_dev=True, window_count=3)(series)
>>> stats_fixed_window_count.to_pandas()
count earliest_point.timestamp earliest_point.value end_timestamp largest_point.timestamp largest_point.value latest_point.timestamp latest_point.value mean smallest_point.timestamp smallest_point.value standard_deviation start_timestamp
0 6 1970-01-01 00:00:00.000000001 8.0 1970-01-01 00:00:00.000000335 1970-01-01 00:00:00.000000299 35.0 1970-01-01 00:00:00.000000300 16.0 11.0 1970-01-01 00:00:00.000000201 1.0 11.83216 1970-01-01 00:00:00.000000001
1 1 1970-01-01 00:00:00.000000350 32.0 1970-01-01 00:00:00.000000669 1970-01-01 00:00:00.000000350 32.0 1970-01-01 00:00:00.000000350 32.0 32.0 1970-01-01 00:00:00.000000350 32.0 0.00000 1970-01-01 00:00:00.000000335
2 1 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001003 1970-01-01 00:00:00.000001000 64.0 1970-01-01 00:00:00.000001000 64.0 64.0 1970-01-01 00:00:00.000001000 64.0 0.00000 1970-01-01 00:00:00.000000669