Transforms(转换(Transforms))¶
Python¶
Notional data generation¶
How can I create fake data in a transform?
This code uses the synthesizer library to generate a random dataset with specified columns and data types.
from transforms.api import transform_df, Output, configure
from synthesizer import DataFrameTransformOrchestrator
import pyspark.sql.functions as F
'''
Transform to generate data from scratch
Think about importing "synthetizer" library !
'''
ROW_SPEC = {
"level_1": {
"random_element": {
"elements": list(range(0, 10))
}
},
"level_2": {
"random_element": {
"elements": list(range(0, 20))
}
},
"level_3": {
"random_element": {
"elements": list(range(0, 50))
}
},
"level_4": {
"random_element": {
"elements": list(range(0, 100))
}
},
"ThisIsFakeData": "first_name",
"First_Name": "first_name",
"Last_Name": "last_name",
"Phone_Number": "phone_number",
"Address_Contact": "address",
"Birth_Date": {
"date_time_between": {
"start_date": "-80y",
"end_date": "-3m"
}
},
"Title": {
"random_element": {
"elements": ["Mr.", "Mrs.", "Ms.", "Dr."]
}
},
}
@transform_df(
Output("/Palantir/awesome-foundry/[PipelineMocking] Dataset Generation/generated_data"),
)
def function_40(ctx):
# Parameters
ROW_COUNT = 10_000
# Create a dataset generator
orch = DataFrameTransformOrchestrator(
ROW_SPEC,
ROW_COUNT,
ctx.spark_session
)
# Generate the dataset
df = orch()
# Create new columns, perform operations on it ...
df = df.withColumn("pk", F.concat_ws("__", "level_1", "level_2", "level_3", "level_4"))
return df
- Date submitted: 2024-03-26
- Tags:
code authoring,code repositories,data generation,unstructured,synthesizer
中文翻译¶
转换(Transforms)¶
Python¶
模拟数据生成(Notional data generation)¶
如何在转换中创建模拟数据?
以下代码使用合成器库(synthesizer library)生成一个包含指定列和数据类型的随机数据集。
from transforms.api import transform_df, Output, configure
from synthesizer import DataFrameTransformOrchestrator
import pyspark.sql.functions as F
'''
从零开始生成数据的转换
请记得导入 "synthetizer" 库!
'''
ROW_SPEC = {
"level_1": {
"random_element": {
"elements": list(range(0, 10))
}
},
"level_2": {
"random_element": {
"elements": list(range(0, 20))
}
},
"level_3": {
"random_element": {
"elements": list(range(0, 50))
}
},
"level_4": {
"random_element": {
"elements": list(range(0, 100))
}
},
"ThisIsFakeData": "first_name",
"First_Name": "first_name",
"Last_Name": "last_name",
"Phone_Number": "phone_number",
"Address_Contact": "address",
"Birth_Date": {
"date_time_between": {
"start_date": "-80y",
"end_date": "-3m"
}
},
"Title": {
"random_element": {
"elements": ["Mr.", "Mrs.", "Ms.", "Dr."]
}
},
}
@transform_df(
Output("/Palantir/awesome-foundry/[PipelineMocking] Dataset Generation/generated_data"),
)
def function_40(ctx):
# 参数设置
ROW_COUNT = 10_000
# 创建数据集生成器
orch = DataFrameTransformOrchestrator(
ROW_SPEC,
ROW_COUNT,
ctx.spark_session
)
# 生成数据集
df = orch()
# 创建新列,对其执行操作...
df = df.withColumn("pk", F.concat_ws("__", "level_1", "level_2", "level_3", "level_4"))
return df
- 提交日期:2024-03-26
- 标签:
代码编写,代码仓库,数据生成,非结构化,合成器