跳转至

Aggregation and pivot tables(聚合与透视表)

Aggregation Syntax

There are a number of ways to produce aggregations in PySpark. We recommend this syntax as the most reliable.

aggregated_df = df.groupBy('state').agg(
    F.max('city_population').alias('largest_city_in_state')
)
state city city_population
WA Bellevue 100000
WA Seattle 700000
PA Philadelphia 1500000
PA Pittsburgh 300000
state largest_city_in_state
PA 1500000
WA 700000

Examples

The .groupBy() method on a DataFrame takes an arbitrary number of columns over which to perform the aggregations. Alternatively, to aggregate across the whole DataFrame, include no columns.

aggregated_df = df.groupBy('state', 'county').agg(
    F.max('city_population').alias('largest_city_in_state_county')
)

aggregated_df = df.groupBy().agg(
    F.max('city_population').alias('largest_city_overall')
)

The .agg() method on a grouped DataFrame takes an arbitrary number of aggregation functions.

aggregated_df = df.groupBy('state').agg(
    F.max('city_population').alias('largest_city_in_state'),
    F.avg('city_population').alias('average_population_in_state')
)

By default aggregations produce columns of the form aggregation_name(target_column). However, column names in Foundry cannot contain parentheses or other non-alphanumeric characters. Alias each aggregation to a specific name instead.

Pivot Tables

Pivot tables in PySpark work very similarly to ordinary grouped aggregations.

pivoted_df = df.groupBy('equipment').pivot('sensor').mean('value')
equipment sensor value
A temperature 60
A temperature 40
B speed 6
A speed 3
equipment temperature speed
A 50 3
B null 7

Aggregation Functions

Learn more about PySpark aggregate functions. ↗

avg(column) / mean(column)

collect_list(column)

  • Combine all values into an array

collect_set(column)

  • Combine all values into an array with duplicates removed

count(column)

corr(x, y)

  • Pearson Correlation Coefficient for columns x and y.

covar_pop(col1, col2)

covar_samp(col1, col2)

countDistinct(column, *cols)

first(column, ignorenulls=False)

  • First value of the column in the group. Useful for Pivot tables for which we expect only one value to exist but must choose an aggregation anyway.

grouping(column)

grouping_id(*cols)

kurtosis(column)

last(column, ignorenulls=False)

max(column)

min(column)

skewness(column)

stddev(column)

stddev_pop(column)

  • Population standard deviation

stddev_samp(column)

  • Unbiased sample standard deviation

sum(column)

sumDistinct(column)

var_pop(column)

  • Population variance

var_samp(column)

  • Unbiased sample variance

variance(column)


中文翻译

聚合与透视表

聚合语法

在 PySpark 中有多种生成聚合结果的方式。我们推荐使用以下语法,因为它最为可靠。

aggregated_df = df.groupBy('state').agg(
    F.max('city_population').alias('largest_city_in_state')
)
state city city_population
WA Bellevue 100000
WA Seattle 700000
PA Philadelphia 1500000
PA Pittsburgh 300000
state largest_city_in_state
PA 1500000
WA 700000

示例

DataFrame 上的 .groupBy() 方法接受任意数量的列作为聚合依据。如果要对整个 DataFrame 进行聚合,则不传入任何列。

aggregated_df = df.groupBy('state', 'county').agg(
    F.max('city_population').alias('largest_city_in_state_county')
)

aggregated_df = df.groupBy().agg(
    F.max('city_population').alias('largest_city_overall')
)

分组后的 DataFrame 上的 .agg() 方法接受任意数量的聚合函数。

aggregated_df = df.groupBy('state').agg(
    F.max('city_population').alias('largest_city_in_state'),
    F.avg('city_population').alias('average_population_in_state')
)

默认情况下,聚合结果会生成形如 aggregation_name(target_column) 的列名。然而,在 Foundry 中,列名不能包含括号或其他非字母数字字符。因此,请为每个聚合结果指定别名。

透视表

PySpark 中的透视表与普通的分组聚合非常相似。

pivoted_df = df.groupBy('equipment').pivot('sensor').mean('value')
equipment sensor value
A temperature 60
A temperature 40
B speed 6
A speed 3
equipment temperature speed
A 50 3
B null 7

聚合函数

了解更多 PySpark 聚合函数 ↗

avg(column) / mean(column)

collect_list(column)

  • 将所有值合并为一个数组

collect_set(column)

  • 将所有值合并为一个数组,并去除重复项

count(column)

corr(x, y)

  • xy 的皮尔逊相关系数

covar_pop(col1, col2)

covar_samp(col1, col2)

countDistinct(column, *cols)

first(column, ignorenulls=False)

  • 分组中该列的第一个值。适用于透视表,当预期只有一个值存在但必须选择某种聚合方式时使用。

grouping(column)

grouping_id(*cols)

kurtosis(column)

last(column, ignorenulls=False)

max(column)

min(column)

skewness(column)

stddev(column)

stddev_pop(column)

  • 总体标准差

stddev_samp(column)

  • 无偏样本标准差

sum(column)

sumDistinct(column)

var_pop(column)

  • 总体方差

var_samp(column)

  • 无偏样本方差

variance(column)