Concept: Queries(概念:查询(Queries))¶
Distinct, drop duplicates¶
DataFrame.distinct()¶
Returns a new DataFrame containing the distinct rows in the originating DataFrame.
df = df.distinct()
DataFrame.drop_duplicates(subset=None)¶
Returns a new DataFrame with duplicate rows removed, optionally only considering certain columns.
df = df.drop_duplicates()
df = df.drop_duplicates(["firstname", "lastname"])
Drop null values¶
DataFrame.dropna(how='any', thresh=None, subset=None)¶
Alias: DataFrame.na.dropna(how='any', thresh=None, subset=None)
Returns a new DataFrame omitting rows with null values.DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other.
Parameters:
- how –
'any'or'all'. - If
'any', drop a row if it contains any nulls. - If
'all', drop a row only if all its values are null. - thresh – integer, default
None. If specified, drop rows that have less than thresh non-null values. (This overwrites the how parameter). - subset – optional list of column names to consider.
Limit rows¶
DataFrame.limit(number)¶
Sorting¶
DataFrame.sort(*cols, **kwargs)¶
Alias: DataFrame.orderBy(*cols, **kwargs)
Column.asc()orF.asc(col)Column.desc()orF.desc(col)
中文翻译¶
概念:查询(Queries)¶
去重、删除重复项(Distinct、drop duplicates)¶
DataFrame.distinct()¶
返回一个新的DataFrame,包含源DataFrame中的所有唯一行。
df = df.distinct()
DataFrame.drop_duplicates(subset=None)¶
返回移除了重复行的新DataFrame,支持可选配置仅基于指定列判定重复。
df = df.drop_duplicates()
df = df.drop_duplicates(["firstname", "lastname"])
丢弃空值(Drop null values)¶
DataFrame.dropna(how='any', thresh=None, subset=None)¶
别名:DataFrame.na.dropna(how='any', thresh=None, subset=None)
返回丢弃了含空值行的新DataFrame,DataFrame.dropna()和DataFrameNaFunctions.drop()互为别名。
参数:
* how – 取值为'any'或'all'
* 若为'any',只要行内存在任意空值就丢弃该行
* 若为'all',仅当行内所有值都为空时才丢弃该行
* thresh – 整数类型,默认值为None。如果指定该参数,会丢弃非空值数量小于thresh的行(该参数优先级高于how参数)
* subset – 可选参数,为要参与空值判定的列名组成的列表
限制行数(Limit rows)¶
DataFrame.limit(number)¶
排序(Sorting)¶
DataFrame.sort(*cols, **kwargs)¶
别名:DataFrame.orderBy(*cols, **kwargs)
Column.asc()或F.asc(col)Column.desc()或F.desc(col)