跳转至

Concept: Queries(概念:查询(Queries))

Distinct, drop duplicates

DataFrame.distinct()

Returns a new DataFrame containing the distinct rows in the originating DataFrame.

df = df.distinct()

DataFrame.drop_duplicates(subset=None)

Returns a new DataFrame with duplicate rows removed, optionally only considering certain columns.

df = df.drop_duplicates()
df = df.drop_duplicates(["firstname", "lastname"])

Drop null values

DataFrame.dropna(how='any', thresh=None, subset=None)

Alias: DataFrame.na.dropna(how='any', thresh=None, subset=None)

Returns a new DataFrame omitting rows with null values.DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other.

Parameters:

  • how'any' or 'all'.
  • If 'any', drop a row if it contains any nulls.
  • If 'all', drop a row only if all its values are null.
  • thresh – integer, default None. If specified, drop rows that have less than thresh non-null values. (This overwrites the how parameter).
  • subset – optional list of column names to consider.

Limit rows

DataFrame.limit(number)

Sorting

DataFrame.sort(*cols, **kwargs)

Alias: DataFrame.orderBy(*cols, **kwargs)

  • Column.asc() or F.asc(col)
  • Column.desc() or F.desc(col)

中文翻译


概念:查询(Queries)

去重、删除重复项(Distinct、drop duplicates)

DataFrame.distinct()

返回一个新的DataFrame,包含源DataFrame中的所有唯一行。

df = df.distinct()

DataFrame.drop_duplicates(subset=None)

返回移除了重复行的新DataFrame,支持可选配置仅基于指定列判定重复。

df = df.drop_duplicates()
df = df.drop_duplicates(["firstname", "lastname"])

丢弃空值(Drop null values)

DataFrame.dropna(how='any', thresh=None, subset=None)

别名:DataFrame.na.dropna(how='any', thresh=None, subset=None)

返回丢弃了含空值行的新DataFrameDataFrame.dropna()DataFrameNaFunctions.drop()互为别名。

参数: * how – 取值为'any''all' * 若为'any',只要行内存在任意空值就丢弃该行 * 若为'all',仅当行内所有值都为空时才丢弃该行 * thresh – 整数类型,默认值为None。如果指定该参数,会丢弃非空值数量小于thresh的行(该参数优先级高于how参数) * subset – 可选参数,为要参与空值判定的列名组成的列表

限制行数(Limit rows)

DataFrame.limit(number)

排序(Sorting)

DataFrame.sort(*cols, **kwargs)

别名:DataFrame.orderBy(*cols, **kwargs)

  • Column.asc()F.asc(col)
  • Column.desc()F.desc(col)