跳转至

foundryts.functions.distribution

foundryts.functions.distribution(start=None, end=None, start_value=None, end_value=None, bins=None)

Returns a function that will evaluate the distribution of one or more time-series.

A distribution is a breakdown of points into bins of values that partition the requested range of values. Evaluating the distribution returns a list of the bins which describe the number of points in their range, as well as the start and end of the range.

The distribution can be applied to a single series or multiple series, in which case the distribution function considers a union of values from all series for each bin in the final dataframe.

The delta for the value range for each bin is constant and is calculated using (max value - min value) / (number of bins)

  • Parameters:
  • start (Union [int , datetime , str ] , optional) – Timestamp (inclusive) to start evaluating a distribution over the provided series (default is the earliest timestamp in any of the input time series)
  • end (Union [int , datetime , str ] , optional) – Timestamp (exclusive) to end evaluating a distribution over the provided series (default is the latest timestamp in any of the input time series)
  • start_value (float , optional) – Lower bound (inclusive) of the value range to evaluate the distribution over (default is the minimum value of any of the input time series)
  • end_value (float , optional) – Upper bound (exclusive) of the value range to evaluate the distribution over (default is the maximum value of any of the input time series)
  • bins (int , optional) – Number of value-bins to distribute points over (default is 10).
  • Returns: A function that accepts one or more series as inputs and generates the distribution over all points in the specified or default number of bins.
  • Return type: (Union[FunctionNode, NodeCollection]) -> SummarizerNode

Dataframe schema

Column name Type Description
start_timestamp datetime Start time of the distribution (inclusive)
end_timestamp datetime End time of the distribution (exclusive)
start float Lower bound of values (inclusive)
end float Upper bound of values (exclusive)
delta float The difference between the min and max values of
each bin. Given how bins are calculated, delta is
fixed for all bins.
distribution_values.start float Start value of a distribution bin
distribution_values.end float End value of a distribution bin
distribution_values.count int Number of instances in a distribution bin

:::callout{theme="success" title="See Also"} statistics(), scatter() :::

:::callout{theme="warning" title="Note"} This function is only applicable to numeric series. :::

Examples

>>> series_1 = F.points(
...     (1, 0.0),
...     (101, 10.2),
...     (200, 11.3),
...     (201, 11.1),
...     (299, 11.2),
...     (300, 12.0),
...     (400, 11.7),
...     (500, 16.0),
...     (123450, 11.8),
...     name="series-1",
... )
>>> series_2 = F.points(
...     (1, 0.5),
...     (101, 0.2),
...     (200, 1.3),
...     (201, 0.1),
...     (299, 1.2),
...     (300, 1.4),
...     (400, 1.0),
...     (500, 2.0),
...     (123450, 1.0),
...     name="series-2",
... )
>>> series_1.to_pandas()
                      timestamp  value
0 1970-01-01 00:00:00.000000001    0.0
1 1970-01-01 00:00:00.000000101   10.2
2 1970-01-01 00:00:00.000000200   11.3
3 1970-01-01 00:00:00.000000201   11.1
4 1970-01-01 00:00:00.000000299   11.2
5 1970-01-01 00:00:00.000000300   12.0
6 1970-01-01 00:00:00.000000400   11.7
7 1970-01-01 00:00:00.000000500   16.0
8 1970-01-01 00:00:00.000123450   11.8
>>> series_2.to_pandas()
                      timestamp  value
0 1970-01-01 00:00:00.000000001    0.5
1 1970-01-01 00:00:00.000000101    0.2
2 1970-01-01 00:00:00.000000200    1.3
3 1970-01-01 00:00:00.000000201    0.1
4 1970-01-01 00:00:00.000000299    1.2
5 1970-01-01 00:00:00.000000300    1.4
6 1970-01-01 00:00:00.000000400    1.0
7 1970-01-01 00:00:00.000000500    2.0
8 1970-01-01 00:00:00.000123450    1.0
>>> nc = NodeCollection(series_1, series_2)
>>> single_dist = F.distribution(bins=3)(series_1) # single series distribution
>>> single_dist.to_pandas()
      delta  distribution_values.count  distribution_values.end  distribution_values.start   end end_timestamp  start               start_timestamp
0  5.333333                          1                 5.333333                   0.000000  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
1  5.333333                          1                10.666667                   5.333333  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
2  5.333333                          7                16.000000                  10.666667  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
>>> multiple_dist = F.distribution(bins=3)(nc) # multiple series distribution
>>> multiple_dist.to_pandas()
      delta  distribution_values.count  distribution_values.end  distribution_values.start   end end_timestamp  start               start_timestamp
0  5.333333                         10                 5.333333                   0.000000  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
1  5.333333                          1                10.666667                   5.333333  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
2  5.333333                          7                16.000000                  10.666667  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216

中文翻译

foundryts.functions.distribution

foundryts.functions.distribution(start=None, end=None, start_value=None, end_value=None, bins=None)

返回一个函数,用于评估一个或多个时间序列(time-series)的分布。

分布(distribution)是将数据点按值划分到多个区间(bins)中,这些区间对请求的值范围进行划分。评估分布会返回一个区间列表,描述每个区间范围内的数据点数量,以及该范围的起始值和结束值。

该分布可应用于单个序列或多个序列。当应用于多个序列时,分布函数会将所有序列的值合并,在最终的数据框(dataframe)中为每个区间考虑所有序列的值的并集。

每个区间的值范围增量(delta)是恒定的,计算公式为:(最大值 - 最小值) / (区间数量)

  • 参数:
  • start (Union [int , datetime , str ] , 可选) – 开始评估分布的时间戳(包含),默认为所有输入时间序列中最早的时间戳
  • end (Union [int , datetime , str ] , 可选) – 结束评估分布的时间戳(不包含),默认为所有输入时间序列中最晚的时间戳
  • start_value (float , 可选) – 评估分布的值范围的下限(包含),默认为所有输入时间序列中的最小值
  • end_value (float , 可选) – 评估分布的值范围的上限(不包含),默认为所有输入时间序列中的最大值
  • bins (int , 可选) – 用于分配数据点的值区间数量,默认为10
  • 返回值: 一个函数,接受一个或多个序列作为输入,并在指定或默认数量的区间内生成所有数据点的分布。
  • 返回类型: (Union[FunctionNode, NodeCollection]) -> SummarizerNode

数据框模式(Dataframe schema)

列名 类型 描述
start_timestamp datetime 分布的起始时间(包含)
end_timestamp datetime 分布的结束时间(不包含)
start float 值的下限(包含)
end float 值的上限(不包含)
delta float 每个区间最小值和最大值之间的差值。根据区间的计算方式,所有区间的delta是固定的。
distribution_values.start float 分布区间的起始值
distribution_values.end float 分布区间的结束值
distribution_values.count int 分布区间内的实例数量

:::callout{theme="success" title="另请参阅"} statistics(), scatter() :::

:::callout{theme="warning" title="注意"} 此函数仅适用于数值型序列。 :::

示例

>>> series_1 = F.points(
...     (1, 0.0),
...     (101, 10.2),
...     (200, 11.3),
...     (201, 11.1),
...     (299, 11.2),
...     (300, 12.0),
...     (400, 11.7),
...     (500, 16.0),
...     (123450, 11.8),
...     name="series-1",
... )
>>> series_2 = F.points(
...     (1, 0.5),
...     (101, 0.2),
...     (200, 1.3),
...     (201, 0.1),
...     (299, 1.2),
...     (300, 1.4),
...     (400, 1.0),
...     (500, 2.0),
...     (123450, 1.0),
...     name="series-2",
... )
>>> series_1.to_pandas()
                      timestamp  value
0 1970-01-01 00:00:00.000000001    0.0
1 1970-01-01 00:00:00.000000101   10.2
2 1970-01-01 00:00:00.000000200   11.3
3 1970-01-01 00:00:00.000000201   11.1
4 1970-01-01 00:00:00.000000299   11.2
5 1970-01-01 00:00:00.000000300   12.0
6 1970-01-01 00:00:00.000000400   11.7
7 1970-01-01 00:00:00.000000500   16.0
8 1970-01-01 00:00:00.000123450   11.8
>>> series_2.to_pandas()
                      timestamp  value
0 1970-01-01 00:00:00.000000001    0.5
1 1970-01-01 00:00:00.000000101    0.2
2 1970-01-01 00:00:00.000000200    1.3
3 1970-01-01 00:00:00.000000201    0.1
4 1970-01-01 00:00:00.000000299    1.2
5 1970-01-01 00:00:00.000000300    1.4
6 1970-01-01 00:00:00.000000400    1.0
7 1970-01-01 00:00:00.000000500    2.0
8 1970-01-01 00:00:00.000123450    1.0
>>> nc = NodeCollection(series_1, series_2)
>>> single_dist = F.distribution(bins=3)(series_1) # 单序列分布
>>> single_dist.to_pandas()
      delta  distribution_values.count  distribution_values.end  distribution_values.start   end end_timestamp  start               start_timestamp
0  5.333333                          1                 5.333333                   0.000000  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
1  5.333333                          1                10.666667                   5.333333  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
2  5.333333                          7                16.000000                  10.666667  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
>>> multiple_dist = F.distribution(bins=3)(nc) # 多序列分布
>>> multiple_dist.to_pandas()
      delta  distribution_values.count  distribution_values.end  distribution_values.start   end end_timestamp  start               start_timestamp
0  5.333333                         10                 5.333333                   0.000000  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
1  5.333333                          1                10.666667                   5.333333  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216
2  5.333333                          7                16.000000                  10.666667  16.0    2262-01-01    0.0 1677-09-21 00:12:43.145225216