跳转至

Code Repositories(代码仓库)

Python

Create stable primary keys in PySpark

This code uses the concat_ws() and sha2() functions in PySpark to create stable and unique primary keys by concatenating multiple columns and optionally hashing the result. This approach ensures that primary keys are robust, stable, and effective for uniquely identifying rows in datasets.

Primary keys should be unique, non-null, and stable. Using non-deterministic or unstable primary keys, such as monotonically increasing IDs or random numbers, can lead to several issues:

  1. Object identification: Objects in the Ontology are identified by their primary keys. Edits to objects can be lost if each build of the backing dataset generates new sets of IDs.
  2. Change tracking: Tracking changes to datasets and rows becomes difficult if primary keys are not stable and reproducible.
  3. Pipeline consistency: Incremental runs of pipelines producing primary keys or downstream pipelines depending on the dataset can lead to duplicates and other inconsistencies.

A good practice for creating primary keys is to use a combination of columns in the dataset that can uniquely identify each row. For example, in an attendance dataset, the combination of student ID and date columns can uniquely identify each row.

Below is an example code snippet to produce primary keys for a dataset where each row is uniquely identified by columns A, B, and C. The concat_ws() function is used for concatenation, and the sha2() function ensures the same length and format for each key optionally.

from pyspark.sql import functions as F

# Concatenate all columns
df = df.withColumn("primary_key", F.concat_ws(":", "A", "B", "C"))

# Optionally create a hash to ensure the same length and format for all keys
df = df.withColumn("primary_key", F.sha2(F.col("primary_key"), 256))

By following these best practices, you can ensure that your primary keys are robust, stable, and effective for uniquely identifying rows in your datasets.

  • Language: Python
  • Date submitted: 2024-09-16
  • Tags: pyspark, dataframe

中文翻译


代码仓库

Python

在 PySpark 中创建稳定的主键

以下代码使用 PySpark 中的 concat_ws()sha2() 函数,通过拼接多个列并可选地对结果进行哈希处理,来创建稳定且唯一的主键。这种方法能够确保主键具有鲁棒性、稳定性,并能有效唯一标识数据集中的行。

主键应具备唯一性、非空性和稳定性。使用非确定性或不稳定的主键(例如单调递增的 ID 或随机数)可能导致以下问题:

  1. 对象标识:本体(Ontology)中的对象通过其主键进行标识。如果每次构建底层数据集时都生成新的 ID 集合,可能会导致对象编辑丢失。
  2. 变更追踪:如果主键不稳定且不可复现,则难以追踪数据集和行的变更。
  3. 管道一致性:生成主键的管道或依赖该数据集的下游管道在增量运行时,可能导致重复数据及其他不一致问题。

创建主键的一个良好实践是使用数据集中能够唯一标识每一行的列组合。例如,在考勤数据集中,学生 ID日期 列的组合可以唯一标识每一行。

以下是一个代码示例,用于为数据集生成主键,其中每一行由 A、B、C 三列唯一标识。concat_ws() 函数用于拼接,sha2() 函数则确保每个键具有相同的长度和格式(可选)。

from pyspark.sql import functions as F

# 拼接所有列
df = df.withColumn("primary_key", F.concat_ws(":", "A", "B", "C"))

# 可选:创建哈希以确保所有键具有相同的长度和格式
df = df.withColumn("primary_key", F.sha2(F.col("primary_key"), 256))

通过遵循这些最佳实践,您可以确保主键具有鲁棒性、稳定性,并能有效唯一标识数据集中的行。

  • 语言:Python
  • 提交日期:2024-09-16
  • 标签:pysparkdataframe