Aggregation and pivot tables(聚合与透视表)¶
Aggregation Syntax¶
There are a number of ways to produce aggregations in PySpark. We recommend this syntax as the most reliable.
aggregated_df = df.groupBy('state').agg(
F.max('city_population').alias('largest_city_in_state')
)
| state | city | city_population |
|---|---|---|
| WA | Bellevue | 100000 |
| WA | Seattle | 700000 |
| PA | Philadelphia | 1500000 |
| PA | Pittsburgh | 300000 |
| state | largest_city_in_state |
|---|---|
| PA | 1500000 |
| WA | 700000 |
Examples¶
The .groupBy() method on a DataFrame takes an arbitrary number of columns over which to perform the aggregations. Alternatively, to aggregate across the whole DataFrame, include no columns.
aggregated_df = df.groupBy('state', 'county').agg(
F.max('city_population').alias('largest_city_in_state_county')
)
aggregated_df = df.groupBy().agg(
F.max('city_population').alias('largest_city_overall')
)
The .agg() method on a grouped DataFrame takes an arbitrary number of aggregation functions.
aggregated_df = df.groupBy('state').agg(
F.max('city_population').alias('largest_city_in_state'),
F.avg('city_population').alias('average_population_in_state')
)
By default aggregations produce columns of the form
aggregation_name(target_column). However, column names in Foundry cannot contain parentheses or other non-alphanumeric characters. Alias each aggregation to a specific name instead.
Pivot Tables¶
Pivot tables in PySpark work very similarly to ordinary grouped aggregations.
pivoted_df = df.groupBy('equipment').pivot('sensor').mean('value')
| equipment | sensor | value |
|---|---|---|
| A | temperature | 60 |
| A | temperature | 40 |
| B | speed | 6 |
| A | speed | 3 |
| equipment | temperature | speed |
|---|---|---|
| A | 50 | 3 |
| B | null | 7 |
Aggregation Functions¶
Learn more about PySpark aggregate functions. ↗
avg(column) / mean(column)¶
collect_list(column)¶
- Combine all values into an array
collect_set(column)¶
- Combine all values into an array with duplicates removed
count(column)¶
corr(x, y)¶
- Pearson Correlation Coefficient for columns
xandy.
covar_pop(col1, col2)¶
covar_samp(col1, col2)¶
countDistinct(column, *cols)¶
first(column, ignorenulls=False)¶
- First value of the column in the group. Useful for Pivot tables for which we expect only one value to exist but must choose an aggregation anyway.
grouping(column)¶
grouping_id(*cols)¶
kurtosis(column)¶
last(column, ignorenulls=False)¶
max(column)¶
min(column)¶
skewness(column)¶
stddev(column)¶
stddev_pop(column)¶
- Population standard deviation
stddev_samp(column)¶
- Unbiased sample standard deviation
sum(column)¶
sumDistinct(column)¶
var_pop(column)¶
- Population variance
var_samp(column)¶
- Unbiased sample variance
variance(column)¶
中文翻译¶
聚合与透视表¶
聚合语法¶
在 PySpark 中有多种生成聚合结果的方式。我们推荐使用以下语法,因为它最为可靠。
aggregated_df = df.groupBy('state').agg(
F.max('city_population').alias('largest_city_in_state')
)
| state | city | city_population |
|---|---|---|
| WA | Bellevue | 100000 |
| WA | Seattle | 700000 |
| PA | Philadelphia | 1500000 |
| PA | Pittsburgh | 300000 |
| state | largest_city_in_state |
|---|---|
| PA | 1500000 |
| WA | 700000 |
示例¶
DataFrame 上的 .groupBy() 方法接受任意数量的列作为聚合依据。如果要对整个 DataFrame 进行聚合,则不传入任何列。
aggregated_df = df.groupBy('state', 'county').agg(
F.max('city_population').alias('largest_city_in_state_county')
)
aggregated_df = df.groupBy().agg(
F.max('city_population').alias('largest_city_overall')
)
分组后的 DataFrame 上的 .agg() 方法接受任意数量的聚合函数。
aggregated_df = df.groupBy('state').agg(
F.max('city_population').alias('largest_city_in_state'),
F.avg('city_population').alias('average_population_in_state')
)
默认情况下,聚合结果会生成形如
aggregation_name(target_column)的列名。然而,在 Foundry 中,列名不能包含括号或其他非字母数字字符。因此,请为每个聚合结果指定别名。
透视表¶
PySpark 中的透视表与普通的分组聚合非常相似。
pivoted_df = df.groupBy('equipment').pivot('sensor').mean('value')
| equipment | sensor | value |
|---|---|---|
| A | temperature | 60 |
| A | temperature | 40 |
| B | speed | 6 |
| A | speed | 3 |
| equipment | temperature | speed |
|---|---|---|
| A | 50 | 3 |
| B | null | 7 |
聚合函数¶
avg(column) / mean(column)¶
collect_list(column)¶
- 将所有值合并为一个数组
collect_set(column)¶
- 将所有值合并为一个数组,并去除重复项
count(column)¶
corr(x, y)¶
- 列
x和y的皮尔逊相关系数
covar_pop(col1, col2)¶
covar_samp(col1, col2)¶
countDistinct(column, *cols)¶
first(column, ignorenulls=False)¶
- 分组中该列的第一个值。适用于透视表,当预期只有一个值存在但必须选择某种聚合方式时使用。
grouping(column)¶
grouping_id(*cols)¶
kurtosis(column)¶
last(column, ignorenulls=False)¶
max(column)¶
min(column)¶
skewness(column)¶
stddev(column)¶
stddev_pop(column)¶
- 总体标准差
stddev_samp(column)¶
- 无偏样本标准差
sum(column)¶
sumDistinct(column)¶
var_pop(column)¶
- 总体方差
var_samp(column)¶
- 无偏样本方差