跳转至

foundryts.functions.statistics

foundryts.functions.statistics(start=None, end=None, window=None, **kwargs)

Returns a function that will partition a single series into windows and compute statistics over each window.

The series is partitioned into windows using the window arg. Each window contains statistics listed in the Dataframe schema. The statistics are calculated over rolling windows created either based on the periodicity (width of each window) or a fixed number of windows where width is calculated using round(total number of points / number of buckets). The option used for the rolling window is decided by the window or window_count argument is passed.

  • Parameters:
  • start (Union [int , datetime , str ] , optional) – Timestamp (inclusive) to start partitioning windows from the provided series. (default is the entire series)
  • end (Union [int , datetime , str ] , optional) – Timestamp (inclusive) to end partitioning windows from the provided series. (default is the entire series)
  • window (Union [int , datetime , str ] , optional) –

    The timedelta which is the width of each window, and the size of each window is used to divide the series into a : number of windows. (default is the entire series) * **kwargs – Flags for determining the window behavior and the output type. * Keyword Arguments: * include_std_dev (bool , False) – If set to True, the output will include the standard deviation. * window_count (int , optional) – Number of windows to compute the statistics over (instead of the size of each window). * Returns: Returns a function that accepts a single series as input, and partitions it into windows with each window providing statistics over each window. * Return type: (FunctionNode) -> SummarizerNode

Dataframe schema

Column name Type Description
count int Number of data points in the window of the input
series.
earliest_point.timestamp datetime Timestamp of the first data point in the window of
the input series.
earliest_point.value float Value of the first data point in the window of
the input series.
end_timestamp datetime Timestamp of the last data point
largest_point.timestamp datetime Timestamp of the data point with the largest value
in the window of the input series.
largest_point.value float Largest value in the window of the input series.
latest_point.timestamp datetime Timestamp of the most recent data point in the
window of the input series.
latest_point.value float Value of the most recent data point in the window
of the input series.
mean float Average value of all data points in the window of
the input series.
smallest_point.timestamp datetime Timestamp of the data point with the smallest value
in the window of the input series.
smallest_point.value float Smallest value in the window of the input series.
start_timestamp datetime Timestamp of the first data point

:::callout{theme="success" title="See Also"} distribution(), scatter() :::

Notes

This function is only applicable to numeric series.

In the future, the include_std_dev kwarg will be deprecated as this feature will be made the default.

window_count can only be used with include_std_dev, and this will override window. If passed without include_std_dev, window_count will be ignored.

Examples

>>> series = F.points(
...     (1, 8.0),
...     (101, 4.0),
...     (200, 2.0),
...     (201, 1.0),
...     (299, 35.0),
...     (300, 16.0),
...     (350, 32.0),
...     (1000, 64.0),
... )
                      timestamp  value
0 1970-01-01 00:00:00.000000001    8.0
1 1970-01-01 00:00:00.000000101    4.0
2 1970-01-01 00:00:00.000000200    2.0
3 1970-01-01 00:00:00.000000201    1.0
4 1970-01-01 00:00:00.000000299   35.0
5 1970-01-01 00:00:00.000000300   16.0
6 1970-01-01 00:00:00.000000350   32.0
7 1970-01-01 00:00:00.000001000   64.0
>>> stats = F.statistics(window="100ns")(series) # use time-based window
>>> stats.to_pandas()
   count      earliest_point.timestamp  earliest_point.value                 end_timestamp       largest_point.timestamp  largest_point.value        latest_point.timestamp  latest_point.value       mean      smallest_point.timestamp  smallest_point.value               start_timestamp
0      1 1970-01-01 00:00:00.000000001                   8.0 1970-01-01 00:00:00.000000100 1970-01-01 00:00:00.000000001                  8.0 1970-01-01 00:00:00.000000001                 8.0   8.000000 1970-01-01 00:00:00.000000001                   8.0 1970-01-01 00:00:00.000000000
1      1 1970-01-01 00:00:00.000000101                   4.0 1970-01-01 00:00:00.000000200 1970-01-01 00:00:00.000000101                  4.0 1970-01-01 00:00:00.000000101                 4.0   4.000000 1970-01-01 00:00:00.000000101                   4.0 1970-01-01 00:00:00.000000100
2      3 1970-01-01 00:00:00.000000200                   2.0 1970-01-01 00:00:00.000000300 1970-01-01 00:00:00.000000299                 35.0 1970-01-01 00:00:00.000000299                35.0  12.666667 1970-01-01 00:00:00.000000201                   1.0 1970-01-01 00:00:00.000000200
3      2 1970-01-01 00:00:00.000000300                  16.0 1970-01-01 00:00:00.000000400 1970-01-01 00:00:00.000000350                 32.0 1970-01-01 00:00:00.000000350                32.0  24.000000 1970-01-01 00:00:00.000000300                  16.0 1970-01-01 00:00:00.000000300
4      1 1970-01-01 00:00:00.000001000                  64.0 1970-01-01 00:00:00.000001100 1970-01-01 00:00:00.000001000                 64.0 1970-01-01 00:00:00.000001000                64.0  64.000000 1970-01-01 00:00:00.000001000                  64.0 1970-01-01 00:00:00.000001000
>>> stats_with_std_dev = F.statistics(window="100ns", include_std_dev=True)(series)
>>> stats_with_std_dev.to_pandas()
   count      earliest_point.timestamp  earliest_point.value                 end_timestamp       largest_point.timestamp  largest_point.value        latest_point.timestamp  latest_point.value       mean      smallest_point.timestamp  smallest_point.value  standard_deviation               start_timestamp
0      1 1970-01-01 00:00:00.000000001                   8.0 1970-01-01 00:00:00.000000100 1970-01-01 00:00:00.000000001                  8.0 1970-01-01 00:00:00.000000001                 8.0   8.000000 1970-01-01 00:00:00.000000001                   8.0            0.000000 1970-01-01 00:00:00.000000000
1      1 1970-01-01 00:00:00.000000101                   4.0 1970-01-01 00:00:00.000000200 1970-01-01 00:00:00.000000101                  4.0 1970-01-01 00:00:00.000000101                 4.0   4.000000 1970-01-01 00:00:00.000000101                   4.0            0.000000 1970-01-01 00:00:00.000000100
2      3 1970-01-01 00:00:00.000000200                   2.0 1970-01-01 00:00:00.000000300 1970-01-01 00:00:00.000000299                 35.0 1970-01-01 00:00:00.000000299                35.0  12.666667 1970-01-01 00:00:00.000000201                   1.0           15.797327 1970-01-01 00:00:00.000000200
3      2 1970-01-01 00:00:00.000000300                  16.0 1970-01-01 00:00:00.000000400 1970-01-01 00:00:00.000000350                 32.0 1970-01-01 00:00:00.000000350                32.0  24.000000 1970-01-01 00:00:00.000000300                  16.0            8.000000 1970-01-01 00:00:00.000000300
4      1 1970-01-01 00:00:00.000001000                  64.0 1970-01-01 00:00:00.000001100 1970-01-01 00:00:00.000001000                 64.0 1970-01-01 00:00:00.000001000                64.0  64.000000 1970-01-01 00:00:00.000001000                  64.0            0.000000 1970-01-01 00:00:00.000001000
>>> stats_fixed_window_count = F.statistics(include_std_dev=True, window_count=3)(series)
>>> stats_fixed_window_count.to_pandas()
   count      earliest_point.timestamp  earliest_point.value                 end_timestamp       largest_point.timestamp  largest_point.value        latest_point.timestamp  latest_point.value  mean      smallest_point.timestamp  smallest_point.value  standard_deviation               start_timestamp
0      6 1970-01-01 00:00:00.000000001                   8.0 1970-01-01 00:00:00.000000335 1970-01-01 00:00:00.000000299                 35.0 1970-01-01 00:00:00.000000300                16.0  11.0 1970-01-01 00:00:00.000000201                   1.0            11.83216 1970-01-01 00:00:00.000000001
1      1 1970-01-01 00:00:00.000000350                  32.0 1970-01-01 00:00:00.000000669 1970-01-01 00:00:00.000000350                 32.0 1970-01-01 00:00:00.000000350                32.0  32.0 1970-01-01 00:00:00.000000350                  32.0             0.00000 1970-01-01 00:00:00.000000335
2      1 1970-01-01 00:00:00.000001000                  64.0 1970-01-01 00:00:00.000001003 1970-01-01 00:00:00.000001000                 64.0 1970-01-01 00:00:00.000001000                64.0  64.0 1970-01-01 00:00:00.000001000                  64.0             0.00000 1970-01-01 00:00:00.000000669

中文翻译

foundryts.functions.statistics

foundryts.functions.statistics(start=None, end=None, window=None, **kwargs)

返回一个函数,该函数将单个序列(series)分割成多个窗口(window),并计算每个窗口的统计量。

该序列使用 window 参数分割成多个窗口。每个窗口包含 Dataframe 模式中列出的统计量。统计量基于滚动窗口(rolling window)计算,滚动窗口的创建方式有两种:基于周期性(每个窗口的宽度),或基于固定数量的窗口(此时宽度通过 总数据点数 / 桶数量 的舍入值计算)。滚动窗口的选用方式由 window 或 window_count 参数决定。

  • 参数:
  • start (Union [int , datetime , str ] , 可选) – 从提供的序列中开始划分窗口的时间戳(包含)。默认为整个序列。
  • end (Union [int , datetime , str ] , 可选) – 从提供的序列中结束划分窗口的时间戳(包含)。默认为整个序列。
  • window (Union [int , datetime , str ] , 可选) –

    表示每个窗口宽度的时间增量(timedelta),每个窗口的大小用于将序列划分为若干窗口。默认为整个序列。 * **kwargs – 用于确定窗口行为和输出类型的标志。 * 关键字参数: * include_std_dev (bool , False) – 如果设置为 True,输出将包含标准差(standard deviation)。 * window_count (int , 可选) – 用于计算统计量的窗口数量(而非每个窗口的大小)。 * 返回: 返回一个函数,该函数接受单个序列作为输入,并将其分割成多个窗口,每个窗口提供该窗口内的统计量。 * 返回类型: (FunctionNode) -> SummarizerNode

Dataframe 模式

列名 类型 描述
count int 输入序列窗口中的数据点数量。
earliest_point.timestamp datetime 输入序列窗口中第一个数据点的时间戳。
earliest_point.value float 输入序列窗口中第一个数据点的值。
end_timestamp datetime 最后一个数据点的时间戳。
largest_point.timestamp datetime 输入序列窗口中最大值数据点的时间戳。
largest_point.value float 输入序列窗口中的最大值。
latest_point.timestamp datetime 输入序列窗口中最近数据点的时间戳。
latest_point.value float 输入序列窗口中最近数据点的值。
mean float 输入序列窗口中所有数据点的平均值。
smallest_point.timestamp datetime 输入序列窗口中最小值数据点的时间戳。
smallest_point.value float 输入序列窗口中的最小值。
start_timestamp datetime 第一个数据点的时间戳。

:::callout{theme="success" title="另请参阅"} distribution(), scatter() :::

说明

此函数仅适用于数值型序列。

未来,include_std_dev 关键字参数将被弃用,因为该功能将成为默认行为。

window_count 只能与 include_std_dev 一起使用,并且会覆盖 window 参数。如果在没有 include_std_dev 的情况下传递 window_count,则该参数将被忽略。

示例

>>> series = F.points(
...     (1, 8.0),
...     (101, 4.0),
...     (200, 2.0),
...     (201, 1.0),
...     (299, 35.0),
...     (300, 16.0),
...     (350, 32.0),
...     (1000, 64.0),
... )
                      timestamp  value
0 1970-01-01 00:00:00.000000001    8.0
1 1970-01-01 00:00:00.000000101    4.0
2 1970-01-01 00:00:00.000000200    2.0
3 1970-01-01 00:00:00.000000201    1.0
4 1970-01-01 00:00:00.000000299   35.0
5 1970-01-01 00:00:00.000000300   16.0
6 1970-01-01 00:00:00.000000350   32.0
7 1970-01-01 00:00:00.000001000   64.0
>>> stats = F.statistics(window="100ns")(series) # 使用基于时间的窗口
>>> stats.to_pandas()
   count      earliest_point.timestamp  earliest_point.value                 end_timestamp       largest_point.timestamp  largest_point.value        latest_point.timestamp  latest_point.value       mean      smallest_point.timestamp  smallest_point.value               start_timestamp
0      1 1970-01-01 00:00:00.000000001                   8.0 1970-01-01 00:00:00.000000100 1970-01-01 00:00:00.000000001                  8.0 1970-01-01 00:00:00.000000001                 8.0   8.000000 1970-01-01 00:00:00.000000001                   8.0 1970-01-01 00:00:00.000000000
1      1 1970-01-01 00:00:00.000000101                   4.0 1970-01-01 00:00:00.000000200 1970-01-01 00:00:00.000000101                  4.0 1970-01-01 00:00:00.000000101                 4.0   4.000000 1970-01-01 00:00:00.000000101                   4.0 1970-01-01 00:00:00.000000100
2      3 1970-01-01 00:00:00.000000200                   2.0 1970-01-01 00:00:00.000000300 1970-01-01 00:00:00.000000299                 35.0 1970-01-01 00:00:00.000000299                35.0  12.666667 1970-01-01 00:00:00.000000201                   1.0 1970-01-01 00:00:00.000000200
3      2 1970-01-01 00:00:00.000000300                  16.0 1970-01-01 00:00:00.000000400 1970-01-01 00:00:00.000000350                 32.0 1970-01-01 00:00:00.000000350                32.0  24.000000 1970-01-01 00:00:00.000000300                  16.0 1970-01-01 00:00:00.000000300
4      1 1970-01-01 00:00:00.000001000                  64.0 1970-01-01 00:00:00.000001100 1970-01-01 00:00:00.000001000                 64.0 1970-01-01 00:00:00.000001000                64.0  64.000000 1970-01-01 00:00:00.000001000                  64.0 1970-01-01 00:00:00.000001000
>>> stats_with_std_dev = F.statistics(window="100ns", include_std_dev=True)(series)
>>> stats_with_std_dev.to_pandas()
   count      earliest_point.timestamp  earliest_point.value                 end_timestamp       largest_point.timestamp  largest_point.value        latest_point.timestamp  latest_point.value       mean      smallest_point.timestamp  smallest_point.value  standard_deviation               start_timestamp
0      1 1970-01-01 00:00:00.000000001                   8.0 1970-01-01 00:00:00.000000100 1970-01-01 00:00:00.000000001                  8.0 1970-01-01 00:00:00.000000001                 8.0   8.000000 1970-01-01 00:00:00.000000001                   8.0            0.000000 1970-01-01 00:00:00.000000000
1      1 1970-01-01 00:00:00.000000101                   4.0 1970-01-01 00:00:00.000000200 1970-01-01 00:00:00.000000101                  4.0 1970-01-01 00:00:00.000000101                 4.0   4.000000 1970-01-01 00:00:00.000000101                   4.0            0.000000 1970-01-01 00:00:00.000000100
2      3 1970-01-01 00:00:00.000000200                   2.0 1970-01-01 00:00:00.000000300 1970-01-01 00:00:00.000000299                 35.0 1970-01-01 00:00:00.000000299                35.0  12.666667 1970-01-01 00:00:00.000000201                   1.0           15.797327 1970-01-01 00:00:00.000000200
3      2 1970-01-01 00:00:00.000000300                  16.0 1970-01-01 00:00:00.000000400 1970-01-01 00:00:00.000000350                 32.0 1970-01-01 00:00:00.000000350                32.0  24.000000 1970-01-01 00:00:00.000000300                  16.0            8.000000 1970-01-01 00:00:00.000000300
4      1 1970-01-01 00:00:00.000001000                  64.0 1970-01-01 00:00:00.000001100 1970-01-01 00:00:00.000001000                 64.0 1970-01-01 00:00:00.000001000                64.0  64.000000 1970-01-01 00:00:00.000001000                  64.0            0.000000 1970-01-01 00:00:00.000001000
>>> stats_fixed_window_count = F.statistics(include_std_dev=True, window_count=3)(series)
>>> stats_fixed_window_count.to_pandas()
   count      earliest_point.timestamp  earliest_point.value                 end_timestamp       largest_point.timestamp  largest_point.value        latest_point.timestamp  latest_point.value  mean      smallest_point.timestamp  smallest_point.value  standard_deviation               start_timestamp
0      6 1970-01-01 00:00:00.000000001                   8.0 1970-01-01 00:00:00.000000335 1970-01-01 00:00:00.000000299                 35.0 1970-01-01 00:00:00.000000300                16.0  11.0 1970-01-01 00:00:00.000000201                   1.0            11.83216 1970-01-01 00:00:00.000000001
1      1 1970-01-01 00:00:00.000000350                  32.0 1970-01-01 00:00:00.000000669 1970-01-01 00:00:00.000000350                 32.0 1970-01-01 00:00:00.000000350                32.0  32.0 1970-01-01 00:00:00.000000350                  32.0             0.00000 1970-01-01 00:00:00.000000335
2      1 1970-01-01 00:00:00.000001000                  64.0 1970-01-01 00:00:00.000001003 1970-01-01 00:00:00.000001000                 64.0 1970-01-01 00:00:00.000001000                64.0  64.0 1970-01-01 00:00:00.000001000                  64.0             0.00000 1970-01-01 00:00:00.000000669