跳转至

Filtering(过滤(Filtering))

Filter, where

DataFrame.filter(expression)

Returns a new DataFrame with a subset of rows determined by the boolean expression. The expression parameter is a boolean column expression that can be derived in various ways.

:::callout{theme="neutral"} filter in the beginning of a transform rather than towards the end to reduce unnecessary computation work and increase build time performance. :::

If your dataset contains a lot of entries, but you only care about a subset of the rows that exist based on a criteria.

df = df.filter(F.col("age") >= 21) # limit dataset to only people over age 21

where is an alias of filter and does exactly the same thing (the choice is up to your or your team's preference for which method name is more clear to read):

df = df.where(F.col("age") >= 21) # limit dataset to only people over age 21

You can also chain filters in a few different ways:

df = df.filter(F.col("age") >= 21).filter(F.col("age") < 35) # or...
df = df.filter((F.col("age") >= 21) & (F.col("age") < 35)) # group comparisons using parenthesis to ensure correct order of evaluation.

Constants (Literals)

Whenever you compare a column to a constant, or "literal", such as a single hard coded string, date, or number, PySpark actually evaluates this basic Python datatype into a "literal" (same thing as declaring F.lit(value)). A literal is simply a column expression with a static value. This is an important dinstinction to know before we move forward, because it implies that all comparisons with literals (whether implicit or explicit) can be substituted with a named column instead. This means you can easily make dynamic comparisons that rely on other columns in the same row.

In some contexts literals are not interpreted correctly. For example, when comparing to a string it may be ambiguous whether you intend to refer to a column named the string or the string itself.

df = df.filter("X" == "Y") # Do X and Y refer to columns or literals?
dff = df.filter(F.col("X") == F.lit("Y")) # Unambiguous.

Logical operations

PySpark has a number of binary logical operations. These are always evaluated into instances of the boolean column expression and can be used to combine conditions.

:::callout{theme="neutral"} Because the arguments to our logical operators are whole columns, not Python primitives, we cannot use the familiar and or or Python operators. These operators expect both arguments to already evaluate to single booleans. PySpark is able to interpret the & (bitwise-and), | (bitwise-or) operators, and ~ negation (tilda) symbols to build up a SQL query that runs very efficiently on all rows. :::

One way this helps us is it permits named variables that reference the value of binary operations. So, in order to increase readability and clarity, especially when you are filtering for several properties, you can descriptively name each comparison:

# filter for records where users are aged [21, 35), or has first_name="John", or last_name=None (null)
at_least_21 = F.col("age") >= 21
younger_than_35 = F.col("age") < 35
within_age_range = at_least_21 & younger_than_35
name_is_john = F.col("first_name") == "John"
last_name_is_null = F.isnull(F.col("last_name"))
df = df.where(within_age_range | name_is_john | last_name_is_null)
return df

Another way this helps us is we can leverage logical operations to provide filtering logic:

  • &: And
df = df.filter(condition_a & condition_b)
  • |: Or
df = df.filter(condition_a | condition_b)
  • ^: Exclusive-or
df = df.filter(condition_a ^ condition_b)
  • ~: Negation
df = df.filter(~condition_a)

Feel free to use python's for-loops to generate conditionals, but be sure to brush up on Boolean algebra ↗ to avoid unnecessary error. Here's an example of how this can be leveraged:

def custom_func(col, value):
    # basic logic or UDF
    return output # True/False

values = ["a", "b", "c", "d"]
condition = F.lit(False)
for x in values:
    condition = condition | custom_func(F.col("column"), x)
df = df.filter(condition)
return df

Like, rlike

The like and rlike methods allow pattern-matching using, respectively, SQL LIKE and Regex syntaxes.

  • For simple substring search, use like.
  • For more complex pattern-matching, use rlike.

Column.like(sql_like)

Returns a boolean column based on a SQL LIKE match, provided by a literal string or column:

df = df.filter(F.col('name').like('Al%'))
age name
2 "Alice"

SQL LIKE wildcards:

  • %: represents ≥ 0 characters
  • _: represents a single character

Examples (from w3schools ↗):

LIKE Operator Description
Column.like('a%') Finds any values that starts with "a"
Column.like('%a') Finds any values that ends with "a"
Column.like('%or%') Finds any values that have "or" in any position
Column.like('_r%') Finds any values that have "r" in the second position
Column.like('a_%_%') Finds any values that starts with "a" and ≥ 3 characters
Column.like('a%o') Finds any values that starts with "a" and ends with "o"

Column.rlike(regex)

Returns a boolean column expression based on a regex match, provided by a literal string or column:

df = df.filter(F.col('phone').rlike('[0-9]{3}(?:.+)?[0-9]{3}(?:.+)?[0-9]{4}'))
name phone
"Alice" "412-512-1234"
"John" "(555) 123-5123"
"Jane" "4121234444"

If leveraged properly, regex is very powerful. Here are some resources to help you get started:

Starts with, ends with, contains

Column.startswith(string)

Returns a boolean column expression indicating whether the column's string value starts with the string (literal, or other column) provided in the parameter.

df = df.filter(F.col("id").startswith("prefix-"))

Column.endswith(string)

Returns a boolean column expression indicating whether the column's string value ends with the string (literal, or other column) provided in the parameter.

df = df.filter(F.col("id").endswith("-suffix"))

Column.contains(string)

Returns a boolean column expression indicating whether the column's string value contains the string (literal, or other column) provided in the parameter.

df = df.filter(F.col("id").contains("string"))

Substring

Column.substr(startPos, length)

Return a string column expression that evaluates the substring of the column's value.

Parameters:

  • startPos - start position, counting from 1 (int or Column)
  • length - length of the substring (int or Column)

  • Creating a column of substrings

df = df.select(F.col("name").substr(1, 3).alias("col"))
col
"Ali"
"Bob"
  1. Filtering on a substring
df = df.filter(F.col("phone").substr(5, 3) == "555")
phone
"323-555-1234"
"897-555-4126"
...

Is in

Column.isin(*cols)

Returns a boolean expression that is evaluated to True if the value of the column is contained by the evaluated values of the arguments (in the form of an argument list, or an array, of Columns or literals).

df = df.filter(F.col("name").isin("Bob", "Mike"))
age name
5 "Bob"
... ...
df = df.filter(F.col("age").isin([1, 2, 3]))
age name
2 "Alice"
... ...

Between

Column.between(lowerBound, upperBound)

Returns a boolean expression that is evaluated to True if the value of the expression is between the lowerBound and upperBound literal or column (inclusive).

within_range = F.col("age").between(10, df.upperBound).alias("age_within_range")
df = df.select(df.name, df.upperBound, df.age, within_range)
name upperbound age age_within_range
"Taylor" 30 35 False
"Sally" 40 34 True
"Lucy" 28 28 True

中文翻译


过滤(Filtering)

过滤与条件筛选(Filter, where)

DataFrame.filter(expression)

返回一个新的DataFrame,其中包含由布尔表达式(Boolean expression)确定的行的子集。expression参数是一个布尔列表达式(Boolean column expression),可以通过多种方式派生。

:::callout{theme="neutral"} 建议在转换(Transform)的开头而非结尾使用filter,以减少不必要的计算工作并提升构建性能。 :::

当数据集包含大量条目,但你只关心基于某些条件存在的行的子集时,可以使用此方法。

df = df.filter(F.col("age") >= 21) # 将数据集限制为仅包含年龄超过21岁的人

wherefilter的别名,功能完全相同(选择哪种方法取决于你或团队对可读性的偏好):

df = df.where(F.col("age") >= 21) # 将数据集限制为仅包含年龄超过21岁的人

你还可以通过几种不同方式链式调用过滤器:

df = df.filter(F.col("age") >= 21).filter(F.col("age") < 35) # 或者...
df = df.filter((F.col("age") >= 21) & (F.col("age") < 35)) # 使用括号分组比较,确保正确的求值顺序。

常量(Constants/Literals)

当你将列与常量(即"字面量(Literal)",例如单个硬编码的字符串、日期或数字)进行比较时,PySpark实际上会将这个基本的Python数据类型求值为"字面量"(等同于声明F.lit(value))。字面量只是一个具有静态值的列表达式。在继续之前,了解这一重要区别很关键,因为它意味着所有与字面量的比较(无论是隐式还是显式)都可以替换为命名列。这意味着你可以轻松地创建依赖于同一行中其他列的动态比较。

在某些上下文中,字面量可能无法被正确解释。例如,当与字符串比较时,可能会产生歧义:你是指名为该字符串的列,还是该字符串本身。

df = df.filter("X" == "Y") # X和Y指的是列还是字面量?
dff = df.filter(F.col("X") == F.lit("Y")) # 无歧义。

逻辑运算(Logical operations)

PySpark提供了多种二元逻辑运算(Binary logical operations)。这些运算始终被求值为布尔列表达式的实例,可用于组合条件。

:::callout{theme="neutral"} 由于逻辑运算符的参数是整个列,而非Python基本类型,因此我们不能使用熟悉的andor Python运算符。这些运算符要求两个参数都已求值为单个布尔值。PySpark能够解释&(按位与)、|(按位或)运算符以及~(取反)符号,以构建一个能在所有行上高效运行的SQL查询。 :::

这样做的一个好处是,我们可以使用命名变量来引用二元运算的结果值。因此,为了提高可读性和清晰度,尤其是在过滤多个属性时,你可以为每个比较赋予描述性名称:

# 筛选年龄在[21, 35)之间,或first_name="John",或last_name为空(null)的记录
at_least_21 = F.col("age") >= 21
younger_than_35 = F.col("age") < 35
within_age_range = at_least_21 & younger_than_35
name_is_john = F.col("first_name") == "John"
last_name_is_null = F.isnull(F.col("last_name"))
df = df.where(within_age_range | name_is_john | last_name_is_null)
return df

另一个好处是,我们可以利用逻辑运算来提供过滤逻辑:

  • &:与(And)
df = df.filter(condition_a & condition_b)
  • |:或(Or)
df = df.filter(condition_a | condition_b)
  • ^:异或(Exclusive-or)
df = df.filter(condition_a ^ condition_b)
  • ~:取反(Negation)
df = df.filter(~condition_a)

你可以自由使用Python的for循环来生成条件,但请务必复习布尔代数(Boolean algebra) ↗以避免不必要的错误。以下是一个如何利用此功能的示例:

def custom_func(col, value):
    # 基本逻辑或UDF
    return output # True/False

values = ["a", "b", "c", "d"]
condition = F.lit(False)
for x in values:
    condition = condition | custom_func(F.col("column"), x)
df = df.filter(condition)
return df

模糊匹配(Like, rlike)

likerlike方法分别使用SQL LIKE和正则表达式(Regex)语法进行模式匹配。

  • 对于简单的子字符串搜索,请使用like
  • 对于更复杂的模式匹配,请使用rlike

Column.like(sql_like)

基于SQL LIKE匹配返回一个布尔列,由字面量字符串或列提供:

df = df.filter(F.col('name').like('Al%'))
age name
2 "Alice"

SQL LIKE通配符:

  • %:表示≥ 0个字符
  • _:表示单个字符

示例(来自w3schools ↗):

LIKE运算符 描述
Column.like('a%') 查找任何以"a"开头的值
Column.like('%a') 查找任何以"a"结尾的值
Column.like('%or%') 查找任何在任意位置包含"or"的值
Column.like('_r%') 查找任何第二个位置为"r"的值
Column.like('a_%_%') 查找任何以"a"开头且长度≥ 3的值
Column.like('a%o') 查找任何以"a"开头并以"o"结尾的值

Column.rlike(regex)

基于正则表达式匹配返回一个布尔列表达式,由字面量字符串或列提供:

df = df.filter(F.col('phone').rlike('[0-9]{3}(?:.+)?[0-9]{3}(?:.+)?[0-9]{4}'))
name phone
"Alice" "412-512-1234"
"John" "(555) 123-5123"
"Jane" "4121234444"

如果使用得当,正则表达式非常强大。以下是一些入门资源:

开头匹配、结尾匹配、包含匹配(Starts with, ends with, contains)

Column.startswith(string)

返回一个布尔列表达式,指示列的字符串值是否以参数中提供的字符串(字面量或其他列)开头。

df = df.filter(F.col("id").startswith("prefix-"))

Column.endswith(string)

返回一个布尔列表达式,指示列的字符串值是否以参数中提供的字符串(字面量或其他列)结尾。

df = df.filter(F.col("id").endswith("-suffix"))

Column.contains(string)

返回一个布尔列表达式,指示列的字符串值是否包含参数中提供的字符串(字面量或其他列)。

df = df.filter(F.col("id").contains("string"))

子字符串(Substring)

Column.substr(startPos, length)

返回一个字符串列表达式,用于求取列值的子字符串。

参数:

  • startPos - 起始位置,从1开始计数(int或Column)
  • length - 子字符串的长度(int或Column)

  • 创建子字符串列

df = df.select(F.col("name").substr(1, 3).alias("col"))
col
"Ali"
"Bob"
  1. 基于子字符串进行过滤
df = df.filter(F.col("phone").substr(5, 3) == "555")
phone
"323-555-1234"
"897-555-4126"
...

包含于(Is in)

Column.isin(*cols)

返回一个布尔表达式,如果列的值包含在参数(以参数列表或数组形式提供的Columns或Literals)的求值结果中,则求值为True

df = df.filter(F.col("name").isin("Bob", "Mike"))
age name
5 "Bob"
... ...
df = df.filter(F.col("age").isin([1, 2, 3]))
age name
2 "Alice"
... ...

区间判断(Between)

Column.between(lowerBound, upperBound)

返回一个布尔表达式,如果表达式的值介于lowerBound和upperBound字面量或列之间(包含边界),则求值为True

within_range = F.col("age").between(10, df.upperBound).alias("age_within_range")
df = df.select(df.name, df.upperBound, df.age, within_range)
name upperbound age age_within_range
"Taylor" 30 35 False
"Sally" 40 34 True
"Lucy" 28 28 True