Read files in a repository（在仓库中读取文件）¶

You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.

To start, In your python repository edit setup.py:

setup(
    name=os.environ['PKG_NAME'],
# ...
    package_data={
        '': ['*.yaml', '*.csv']
    }
)

This tells python to bundle the yaml and csv files into the package. Then place a config file (for example config.yaml, but can be also csv or txt) next to your python transform (e.g. read_yml.py see below):

- name: tbl1
  primaryKey:
  - col1
  - col2
  update:
  - column: col3
    with: 'XXX'

You can read it in your transform read_yml.py with the code below:

from transforms.api import transform_df, Input, Output
from pkg_resources import resource_stream
import yaml
import json

@transform_df(
    Output("/Demo/read_yml")
)
def my_compute_function(ctx):
    stream = resource_stream(__name__, "config.yaml")
    docs = yaml.safe_load(stream)
    return ctx.spark_session.createDataFrame([{'result': json.dumps(docs)}])

So your project structure would be:

some_folder
config.yaml
read_yml.py

This will output in your dataset a single row with one column "result" with content:

[{"primaryKey": ["col1", "col2"], "update": [{"column": "col3", "with": "XXX"}], "name": "tbl1"}]

中文翻译¶

在仓库中读取文件¶

您可以从仓库中将其他文件读取到转换上下文中。这在为转换代码设置引用参数时可能很有用。

首先，在您的 Python 仓库中编辑 setup.py：

setup(
    name=os.environ['PKG_NAME'],
# ...
    package_data={
        '': ['*.yaml', '*.csv']
    }
)

这告诉 Python 将 yaml 和 csv 文件打包到包中。然后在您的 Python 转换文件（例如 read_yml.py，见下文）旁边放置一个配置文件（例如 config.yaml，也可以是 csv 或 txt 格式）：

- name: tbl1
  primaryKey:
  - col1
  - col2
  update:
  - column: col3
    with: 'XXX'

您可以通过以下代码在转换文件 read_yml.py 中读取它：

from transforms.api import transform_df, Input, Output
from pkg_resources import resource_stream
import yaml
import json

@transform_df(
    Output("/Demo/read_yml")
)
def my_compute_function(ctx):
    stream = resource_stream(__name__, "config.yaml")
    docs = yaml.safe_load(stream)
    return ctx.spark_session.createDataFrame([{'result': json.dumps(docs)}])

因此，您的项目结构如下：

some_folder
config.yaml
read_yml.py

这将在您的数据集中输出一行，包含一个名为 "result" 的列，内容如下：

[{"primaryKey": ["col1", "col2"], "update": [{"column": "col3", "with": "XXX"}], "name": "tbl1"}]