跳转至

Transforms(转换(Transforms))

Python

Notional data generation

How can I create fake data in a transform?

This code uses the synthesizer library to generate a random dataset with specified columns and data types.

from transforms.api import transform_df, Output, configure
from synthesizer import DataFrameTransformOrchestrator
import pyspark.sql.functions as F


'''
Transform to generate data from scratch
Think about importing "synthetizer" library !
'''

ROW_SPEC = {
    "level_1": {
        "random_element": {
            "elements": list(range(0, 10))
        }
    },
    "level_2": {
        "random_element": {
            "elements": list(range(0, 20))
        }
    },
    "level_3": {
        "random_element": {
            "elements": list(range(0, 50))
        }
    },
    "level_4": {
        "random_element": {
            "elements": list(range(0, 100))
        }
    },
    "ThisIsFakeData": "first_name",
    "First_Name": "first_name",
    "Last_Name": "last_name",
    "Phone_Number": "phone_number",
    "Address_Contact": "address",
    "Birth_Date": {
        "date_time_between": {
            "start_date": "-80y",
            "end_date": "-3m"
        }
    },
    "Title": {
        "random_element": {
            "elements": ["Mr.", "Mrs.", "Ms.", "Dr."]
        }
    },
}


@transform_df(
    Output("/Palantir/awesome-foundry/[PipelineMocking] Dataset Generation/generated_data"),
)
def function_40(ctx):
    # Parameters
    ROW_COUNT = 10_000

    # Create a dataset generator
    orch = DataFrameTransformOrchestrator(
        ROW_SPEC,
        ROW_COUNT,
        ctx.spark_session
    )

    # Generate the dataset
    df = orch()

    # Create new columns, perform operations on it ...
    df = df.withColumn("pk", F.concat_ws("__", "level_1", "level_2", "level_3", "level_4"))

    return df
  • Date submitted: 2024-03-26
  • Tags: code authoring, code repositories, data generation, unstructured, synthesizer

中文翻译


转换(Transforms)

Python

模拟数据生成(Notional data generation)

如何在转换中创建模拟数据?

以下代码使用合成器库(synthesizer library)生成一个包含指定列和数据类型的随机数据集。

from transforms.api import transform_df, Output, configure
from synthesizer import DataFrameTransformOrchestrator
import pyspark.sql.functions as F


'''
从零开始生成数据的转换
请记得导入 "synthetizer" 库!
'''

ROW_SPEC = {
    "level_1": {
        "random_element": {
            "elements": list(range(0, 10))
        }
    },
    "level_2": {
        "random_element": {
            "elements": list(range(0, 20))
        }
    },
    "level_3": {
        "random_element": {
            "elements": list(range(0, 50))
        }
    },
    "level_4": {
        "random_element": {
            "elements": list(range(0, 100))
        }
    },
    "ThisIsFakeData": "first_name",
    "First_Name": "first_name",
    "Last_Name": "last_name",
    "Phone_Number": "phone_number",
    "Address_Contact": "address",
    "Birth_Date": {
        "date_time_between": {
            "start_date": "-80y",
            "end_date": "-3m"
        }
    },
    "Title": {
        "random_element": {
            "elements": ["Mr.", "Mrs.", "Ms.", "Dr."]
        }
    },
}


@transform_df(
    Output("/Palantir/awesome-foundry/[PipelineMocking] Dataset Generation/generated_data"),
)
def function_40(ctx):
    # 参数设置
    ROW_COUNT = 10_000

    # 创建数据集生成器
    orch = DataFrameTransformOrchestrator(
        ROW_SPEC,
        ROW_COUNT,
        ctx.spark_session
    )

    # 生成数据集
    df = orch()

    # 创建新列,对其执行操作...
    df = df.withColumn("pk", F.concat_ws("__", "level_1", "level_2", "level_3", "level_4"))

    return df
  • 提交日期:2024-03-26
  • 标签:代码编写代码仓库数据生成非结构化合成器